AutoHet: Automated 3D Parallelism System

Updated 31 December 2025

AutoHet is an automated system for planning and executing 3D parallelism on heterogeneous GPU clusters using fine-grained, asymmetric configurations.
It employs multi-stage optimization, including profiling, device grouping, and pipeline load balancing to minimize per-iteration training time.
The system enables efficient spot-instance recovery via layered, locality-aware checkpointing and adaptive shard assembly for improved throughput.

AutoHet is an automated system for planning and executing 3D parallelism when training large neural network models—especially LLMs—on GPU clusters composed of heterogeneous hardware and spot instances. AutoHet generates fine-grained, asymmetric parallel configurations that minimize per-iteration training time and enable efficient, elastic recovery from spot instance preemptions. The system incorporates optimization-based device and stage grouping, layered profiling, and locality-aware checkpointing to overcome challenges unique to heterogeneous clusters, such as device straggling, gradient synchronization overheads, and memory versus computational trade-offs (Wang et al., 24 Dec 2025).

1. Motivation and Challenges in Heterogeneous 3D Parallelism

AutoHet is motivated by the increasing prevalence of heterogeneous fleets in large-scale model training, where clusters contain multiple generations of GPUs (e.g., A100, H800, H20) and the set of available devices fluctuates due to spot instance utilization. Conventional frameworks (Megatron-LM, DeepSpeed, Whale, Alpa) assume homogeneous clusters and enforce symmetric 3D partitioning (Data Parallelism—DP, Tensor Parallelism—TP, Pipeline Parallelism—PP). In contrast, heterogeneity introduces:

Straggler phenomena in DP from uniform batch sizes.
Incompatible TP dimensions that require extra transpose and buffer allocations.
Pipeline stages with misaligned layer counts, resulting in complex gradient synchronization requirements.

Empirical data show that asymmetric TP incurs throughput drops up to 49% for 10B-parameter models due to costly transposes during AllReduce operations. Effective exploitation of heterogeneous clusters demands uniform TP dimension across DP groups, flexible PP mappings, and novel workload-balancing strategies that jointly optimize memory and compute utilization.

2. System Architecture and Workflow

AutoHet’s workflow comprises four sequential phases:

Profiling: The system measures per-layer forward and backward times, memory footprints (parameters, activations, optimizer state) for each GPU type and candidate TP dimension. This produces the key functions $T_{\text{gpu}}^{tp}(n_{\text{layers}})$ , $\mathrm{MEM}_F(n_{\text{layers}})$ , and $\mathrm{MEM}_V(n_{\text{layers}}, \text{stage}_i)$ .
3D Parallel Planning: This proceeds in two stages.
- Device Grouping & DP Count: A mixed-integer nonlinear program assigns each GPU $i$ to exactly one DP group $j$ , maximizing the minimum effective computing power per group. Effective computing power is $G_j = \sum_i g_i x_{i,j}(1-\rho_j)$ , with $g_i$ the GPU's GFLOPS and $\rho_j$ the group’s pipeline bubble ratio.
- GPU Node & Stage Mapping plus PP Load Balancing: A secondary optimization divides model layers $l_i$ across PP stages $i$ to minimize $\max_i(g_i l_i)$ while maintaining memory feasibility ( $\mathrm{MEM}_F(l_i)+\mathrm{MEM}_V(l_i,p_i)\leq m_i$ per GPU).
Profiling-Accelerated Cost Estimation: Candidate parallelism plans are rapidly assessed using profiled layer costs, communication bandwidths, and pipeline bubble overlaps.
Plan Selection and Training Launch: The lowest estimated per-iteration plan is chosen after lightweight profiling sampling and training is launched with this configuration.

3. Theoretical Optimization Formulation

The AutoHet design formalizes device grouping and PP load balancing as discrete optimization problems:

Device-Grouping Problem (Stage 1):

Variables: $x_{i,j}\in\{0,1\}$ (assignment), $y_j\in\{0,1\}$ (non-empty group), $z$ (minimum group compute).
Objective: $\max \sum_{j=1}^{N} (y_j z)$ to maximize the number of DP groups at high effective compute.
Subject to constraints enforcing memory lower bounds, group indicator validity, and single GPU-group mapping.
Solution via SCIP achieves optimal assignments for up to 64 GPUs in 1–160 s.

PP Load-Balancing Problem (Stage 2):

Minimize $\max_{1\leq i \leq P} (g_i l_i)$ under fixed total layers $\sum_{i=1}^P l_i = N_{\text{layers}}$ and per-GPU memory constraints.
Solved by direct enumeration ( $P\leq8$ ) or greedy heuristics.

This optimization framing enables AutoHet to balance workflow granularity and hardware-specific constraints contextually.

4. Asymmetric Parallelism and Scheduling Mechanisms

AutoHet enforces symmetric TP group sizes across DP groups to prevent the high overheads associated with gradient reshaping during DP synchronization. Valid TP dimensions are automatically enumerated such that each node features an integer multiple of a candidate TP group. Asymmetric pipeline parallelism is accommodated: DP groups may possess different PP depths, prompting layer-wise gradient synchronization via multiple ring AllReduce operations.

Workload splitting is governed by the pipeline bubble ratio $\rho_j$ (derived from profiled stage times and micro-batch count with a 1F1B schedule). PP stage mapping assigns fewer consecutive layers to slower GPUs, ensuring memory-heavy and communication-light early PP stages run on less performant devices. NVLink channel priority is reserved first for TP traffic, then DP, then PP, optimizing performance across the interconnect topology.

5. Elastic Recovery and Checkpointing for Spot Instances

AutoHet enables rapid, locality-aware recovery following spot instance preemption via layered checkpointing and adaptive shard loading:

Layer-wise checkpointing saves model parameters and optimizer states by $(\text{layer\_id}, \text{tp\_rank})$ tuples rather than by device.
A layer bitmap tracks which shards reside locally versus in the cloud.
Recovery follows three patterns:
- TP dimension unchanged: surviving GPUs load their local shards.
- TP dimension increased: new larger TP shards are split from original local full-shards.
- TP dimension decreased: multiple previous shards are concatenated for smaller TP groups.

During recovery, shards are assembled from local SSDs or via RDMA (400 Gb/s), only falling back to cloud storage (1.2 GB/s) for missing shards. This locality-driven protocol yields up to 4.38× faster recovery compared to full cloud restore baselines.

6. Empirical Performance Evaluation

On clusters mixing NVIDIA A100 (80 GB), H800 (80 GB), and H20 (100 GB) GPUs (32 total, grouped across four nodes), AutoHet was evaluated using BERT-Large (340M), GPT-3 (6.7B), and LLaMA (6.7–70B) models.

Uniform GPU distribution:
- BERT-Large: 1.38× higher throughput than Megatron-LM.
- GPT-3: 1.53×/1.27× faster than Megatron-LM/Whale.
Non-uniform GPU distribution:
- LLaMA 6.7B: up to 1.79×/1.51× faster than Megatron-LM/Whale for A100 + H800 mixes.
- A100 + H20 mixes yield 1.44×/1.16× improvements.
Breakdown (GPT-3, 4 A100 + 4 H800):
- Device grouping: 1.11×.
- Node/stage mapping: 1.16×.
- PP workload balancing: 1.79× overall.

SCIP solves ILPs for grouping in 1.23–159 s for up to 64 GPUs. Profiling for powers-of-two layer blocks finishes in 11.9–15.4 minutes, about ten times faster than Alpa’s approach.

Recovery performance:

Full DP group reclaimed: 4.38× faster than cloud baseline (Varuna).
Partial DP reclaimed: 1.49× speedup.
New GPUs added: 3.59× speedup via full RDMA-only restore.

7. Design Guidelines and Implications

AutoHet’s empirical and theoretical results support the following design implications for large-scale LLM training on heterogeneous, elastic clusters:

Always enforce symmetric TP across DP groups to minimize gradient reshaping overhead.
Permit asymmetric PP by assigning fewer layers to slower GPUs and synchronizing gradients layer-wise.
Prioritize NVLink routing—TP, then DP, then PP—to maximize utilization of high-speed interconnects.
Decompose planning into global (DP grouping) and local (PP balancing) optimization problems solvable by standard tools.
Accelerate profiling by composing powers-of-two layer probes linearly, minimizing turnaround.
Checkpoint at layer granularity and use a layer bitmap for locality-driven recovery preference (local SSD and RDMA over cloud).
Maintain lightweight elasticity: re-plan training in seconds, recover in seconds to minutes, preserving time savings from spot-instance usage.

The system demonstrates that formal, automated asymmetric 3D parallelism planning—coupled with locality-centric checkpointing—enables practitioners to leverage heterogeneous spot-instance GPU clusters with minimal intervention and with near-optimal efficiency (Wang et al., 24 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Diving into 3D Parallelism with Heterogeneous Spot Instance GPUs: Design and Implications (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AutoHet.