AutoHet: Automated 3D Parallelism System
- AutoHet is an automated system for planning and executing 3D parallelism on heterogeneous GPU clusters using fine-grained, asymmetric configurations.
- It employs multi-stage optimization, including profiling, device grouping, and pipeline load balancing to minimize per-iteration training time.
- The system enables efficient spot-instance recovery via layered, locality-aware checkpointing and adaptive shard assembly for improved throughput.
AutoHet is an automated system for planning and executing 3D parallelism when training large neural network models—especially LLMs—on GPU clusters composed of heterogeneous hardware and spot instances. AutoHet generates fine-grained, asymmetric parallel configurations that minimize per-iteration training time and enable efficient, elastic recovery from spot instance preemptions. The system incorporates optimization-based device and stage grouping, layered profiling, and locality-aware checkpointing to overcome challenges unique to heterogeneous clusters, such as device straggling, gradient synchronization overheads, and memory versus computational trade-offs (Wang et al., 24 Dec 2025).
1. Motivation and Challenges in Heterogeneous 3D Parallelism
AutoHet is motivated by the increasing prevalence of heterogeneous fleets in large-scale model training, where clusters contain multiple generations of GPUs (e.g., A100, H800, H20) and the set of available devices fluctuates due to spot instance utilization. Conventional frameworks (Megatron-LM, DeepSpeed, Whale, Alpa) assume homogeneous clusters and enforce symmetric 3D partitioning (Data Parallelism—DP, Tensor Parallelism—TP, Pipeline Parallelism—PP). In contrast, heterogeneity introduces:
- Straggler phenomena in DP from uniform batch sizes.
- Incompatible TP dimensions that require extra transpose and buffer allocations.
- Pipeline stages with misaligned layer counts, resulting in complex gradient synchronization requirements.
Empirical data show that asymmetric TP incurs throughput drops up to 49% for 10B-parameter models due to costly transposes during AllReduce operations. Effective exploitation of heterogeneous clusters demands uniform TP dimension across DP groups, flexible PP mappings, and novel workload-balancing strategies that jointly optimize memory and compute utilization.
2. System Architecture and Workflow
AutoHet’s workflow comprises four sequential phases:
- Profiling: The system measures per-layer forward and backward times, memory footprints (parameters, activations, optimizer state) for each GPU type and candidate TP dimension. This produces the key functions , , and .
- 3D Parallel Planning: This proceeds in two stages.
- Device Grouping & DP Count: A mixed-integer nonlinear program assigns each GPU to exactly one DP group , maximizing the minimum effective computing power per group. Effective computing power is , with the GPU's GFLOPS and the group’s pipeline bubble ratio.
- GPU Node & Stage Mapping plus PP Load Balancing: A secondary optimization divides model layers across PP stages to minimize while maintaining memory feasibility ( per GPU).
- Profiling-Accelerated Cost Estimation: Candidate parallelism plans are rapidly assessed using profiled layer costs, communication bandwidths, and pipeline bubble overlaps.
- Plan Selection and Training Launch: The lowest estimated per-iteration plan is chosen after lightweight profiling sampling and training is launched with this configuration.
3. Theoretical Optimization Formulation
The AutoHet design formalizes device grouping and PP load balancing as discrete optimization problems:
Device-Grouping Problem (Stage 1):
- Variables: (assignment), (non-empty group), (minimum group compute).
- Objective: to maximize the number of DP groups at high effective compute.
- Subject to constraints enforcing memory lower bounds, group indicator validity, and single GPU-group mapping.
- Solution via SCIP achieves optimal assignments for up to 64 GPUs in 1–160 s.
PP Load-Balancing Problem (Stage 2):
- Minimize under fixed total layers and per-GPU memory constraints.
- Solved by direct enumeration () or greedy heuristics.
This optimization framing enables AutoHet to balance workflow granularity and hardware-specific constraints contextually.
4. Asymmetric Parallelism and Scheduling Mechanisms
AutoHet enforces symmetric TP group sizes across DP groups to prevent the high overheads associated with gradient reshaping during DP synchronization. Valid TP dimensions are automatically enumerated such that each node features an integer multiple of a candidate TP group. Asymmetric pipeline parallelism is accommodated: DP groups may possess different PP depths, prompting layer-wise gradient synchronization via multiple ring AllReduce operations.
Workload splitting is governed by the pipeline bubble ratio (derived from profiled stage times and micro-batch count with a 1F1B schedule). PP stage mapping assigns fewer consecutive layers to slower GPUs, ensuring memory-heavy and communication-light early PP stages run on less performant devices. NVLink channel priority is reserved first for TP traffic, then DP, then PP, optimizing performance across the interconnect topology.
5. Elastic Recovery and Checkpointing for Spot Instances
AutoHet enables rapid, locality-aware recovery following spot instance preemption via layered checkpointing and adaptive shard loading:
- Layer-wise checkpointing saves model parameters and optimizer states by tuples rather than by device.
- A layer bitmap tracks which shards reside locally versus in the cloud.
- Recovery follows three patterns:
- TP dimension unchanged: surviving GPUs load their local shards.
- TP dimension increased: new larger TP shards are split from original local full-shards.
- TP dimension decreased: multiple previous shards are concatenated for smaller TP groups.
During recovery, shards are assembled from local SSDs or via RDMA (400 Gb/s), only falling back to cloud storage (1.2 GB/s) for missing shards. This locality-driven protocol yields up to 4.38× faster recovery compared to full cloud restore baselines.
6. Empirical Performance Evaluation
On clusters mixing NVIDIA A100 (80 GB), H800 (80 GB), and H20 (100 GB) GPUs (32 total, grouped across four nodes), AutoHet was evaluated using BERT-Large (340M), GPT-3 (6.7B), and LLaMA (6.7–70B) models.
- Uniform GPU distribution:
- BERT-Large: 1.38× higher throughput than Megatron-LM.
- GPT-3: 1.53×/1.27× faster than Megatron-LM/Whale.
- Non-uniform GPU distribution:
- LLaMA 6.7B: up to 1.79×/1.51× faster than Megatron-LM/Whale for A100 + H800 mixes.
- A100 + H20 mixes yield 1.44×/1.16× improvements.
- Breakdown (GPT-3, 4 A100 + 4 H800):
- Device grouping: 1.11×.
- Node/stage mapping: 1.16×.
- PP workload balancing: 1.79× overall.
SCIP solves ILPs for grouping in 1.23–159 s for up to 64 GPUs. Profiling for powers-of-two layer blocks finishes in 11.9–15.4 minutes, about ten times faster than Alpa’s approach.
Recovery performance:
- Full DP group reclaimed: 4.38× faster than cloud baseline (Varuna).
- Partial DP reclaimed: 1.49× speedup.
- New GPUs added: 3.59× speedup via full RDMA-only restore.
7. Design Guidelines and Implications
AutoHet’s empirical and theoretical results support the following design implications for large-scale LLM training on heterogeneous, elastic clusters:
- Always enforce symmetric TP across DP groups to minimize gradient reshaping overhead.
- Permit asymmetric PP by assigning fewer layers to slower GPUs and synchronizing gradients layer-wise.
- Prioritize NVLink routing—TP, then DP, then PP—to maximize utilization of high-speed interconnects.
- Decompose planning into global (DP grouping) and local (PP balancing) optimization problems solvable by standard tools.
- Accelerate profiling by composing powers-of-two layer probes linearly, minimizing turnaround.
- Checkpoint at layer granularity and use a layer bitmap for locality-driven recovery preference (local SSD and RDMA over cloud).
- Maintain lightweight elasticity: re-plan training in seconds, recover in seconds to minutes, preserving time savings from spot-instance usage.
The system demonstrates that formal, automated asymmetric 3D parallelism planning—coupled with locality-centric checkpointing—enables practitioners to leverage heterogeneous spot-instance GPU clusters with minimal intervention and with near-optimal efficiency (Wang et al., 24 Dec 2025).