Revisiting Parameter Server in LLM Post-Training
Abstract: Modern data parallel (DP) training favors collective communication over parameter servers (PS) for its simplicity and efficiency under balanced workloads. However, the balanced workload assumption no longer holds in LLM post-training due to the high variance in sequence lengths. Under imbalanced workloads, collective communication creates synchronization barriers, leading to under-utilization of devices with smaller workloads. This change in training dynamics calls for a revisit of the PS paradigm for its robustness to such imbalance. We propose \textbf{On-Demand Communication (ODC)}, which adapts PS into Fully Sharded Data Parallel (FSDP) by replacing collective all-gather and reduce-scatter with direct point-to-point communication. Compared to FSDP, ODC reduces the synchronization barrier from once per layer to once per minibatch and decouples the workload on each device so that faster workers are not stalled. It also enables simpler and more effective load balancing at the minibatch level. Across diverse LLM post-training tasks, ODC consistently improves device utilization and training throughput, achieving up to a 36\% speedup over standard FSDP. These results demonstrate that ODC is a superior fit for the prevalent imbalanced workloads in LLM post-training. Our implementation of ODC and integration with FSDP is open-sourced at https://github.com/sail-sg/odc.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What is this paper about?
This paper looks at how to speed up training LLMs after they’ve already been built (this stage is called “post‑training,” like fine‑tuning and reinforcement learning). The authors noticed that a popular way of sharing work between many computers (called “collective communication”) breaks down when different computers have very different amounts of work to do. So they revisit an older idea (“parameter servers”) and design a new way to communicate called On‑Demand Communication (ODC) that makes training faster and uses the hardware better.
What problem are they trying to solve?
In simple terms: imagine a group project where each teammate has a different number of pages to read. If the rule is “we all must finish each page together before moving on,” then faster readers spend a lot of time waiting. That’s what happens in current LLM training when texts have very different lengths: longer texts take much more time, and shorter ones finish early and sit idle.
The paper asks these key questions:
- How can we stop fast computers from waiting on slow ones during training?
- Can we change the way computers share model pieces so each moves at its own pace?
- Will this make LLM post‑training faster without breaking how training normally works?
How did they study it? (Methods in everyday language)
Think of a giant model split into pieces across many GPUs (graphics cards). Today’s standard method, called FSDP (Fully Sharded Data Parallel), does two big group actions at almost every layer of the model:
- “All‑gather”: every GPU pulls in the whole set of layer weights from each other before doing a forward or backward pass.
- “Reduce‑scatter”: after computing gradients, GPUs combine them and then split them back out.
These group actions act like stoplights at every layer: nobody can move on until everyone is ready. That’s fine when all GPUs have equal work. But with real texts (some short, some very long), GPUs get out of sync and waste time waiting.
The authors propose ODC, which changes “group actions” into “on‑demand, direct exchanges”:
- Instead of all‑gathering at each layer, a GPU fetches only the weight pieces it needs directly from other GPUs when it’s ready.
- Instead of reduce‑scatter at each layer, a GPU sends its gradient pieces straight to the GPU that owns them, and those receivers add them up in the background.
They keep everything else the same (same memory setup, same math for training), but they move the “everyone must wait” point from “every layer” to “the end of the minibatch.” This lets faster GPUs keep going without being stalled by slower ones.
To test this idea, they:
- Integrated ODC into FSDP so it can be used in today’s training setups.
- Ran experiments on real LLM post‑training tasks: supervised fine‑tuning (SFT) with long texts and reinforcement learning (RL) for math reasoning.
- Measured how many samples per second they could process and how busy the GPUs stayed.
- Compared ODC to the standard collective method and different ways of “packing” sequences to balance work.
What did they find, and why does it matter?
Here are the main results in simple terms:
- ODC makes training faster on long‑text fine‑tuning tasks, with up to 36% speedup over the standard method. That means more examples processed in the same time and less waiting.
- In reinforcement learning training, ODC also speeds things up—by up to about 10%—even though those tasks had less extreme differences in text length.
- ODC reduces idle time because GPUs don’t have to stop at every layer; they only synchronize at the end of each minibatch.
- ODC simplifies balancing work: instead of forcing every GPU to handle the same number of small chunks (microbatches), you can balance total work across the whole minibatch and let GPUs pack their local data however fits their memory. This is easier and often more effective.
Why it’s important:
- Real‑world text varies a lot in length, and longer texts take much more compute (attention cost grows roughly with the square of length). ODC matches this reality better than the old “everyone moves together” approach.
- Faster training means lower costs and shorter turnaround for improving models.
- The code is open‑sourced, so others can try it and build on it.
What are the broader takeaways and future impact?
In simple terms:
- The old “group‑move” style is great when everyone has equal work, but real LLM post‑training isn’t like that. ODC brings back the good parts of the parameter server idea (letting workers move at their own pace) and blends it with modern, memory‑efficient training.
- ODC works especially well for tasks with long, uneven sequences, which are common in today’s LLM fine‑tuning and RL pipelines.
- There are trade‑offs: direct point‑to‑point exchanges can be slower across different machines than highly optimized group operations. But in many long‑sequence cases, the extra computation hides this cost, and there are practical fixes (like keeping model shards within the same machine).
- Looking ahead, ODC could be extended to support even more flexibility (like slightly asynchronous updates) and better resilience (handling machine failures or changing cluster sizes), making large‑scale training more robust.
Overall, the paper shows a simple idea with big impact: let each GPU pull and push what it needs when it’s ready, instead of forcing everyone to stop and go together at every layer. This change fits real data better and makes LLM post‑training faster and more efficient.
Knowledge Gaps
Below is a concise, actionable list of the paper’s unresolved knowledge gaps, limitations, and open questions for future research.
- Quantify and mitigate inter-node inefficiency: ODC’s point-to-point RDMA lags NCCL collectives across nodes; design and evaluate hierarchical P2P overlays, cache-aware routing, and topology-aware scheduling to regain cross-node efficiency.
- Predictive performance models: Develop analytical and empirical models that predict when ODC outperforms collectives as a function of sequence-length distribution, packing ratio, minibatch size, device count, and interconnect bandwidth; use these to drive auto-tuning.
- Generalization to diverse hardware and topologies: Validate ODC on different GPUs (e.g., H100/B200), interconnects (InfiniBand/Ethernet/PCIe-only), and heterogeneous or multi-tenant clusters; characterize NIC saturation, CPU involvement, and energy/cost trade-offs.
- Formal convergence and numerical equivalence: Provide rigorous evidence that synchronous minibatch-boundary updates in ODC are numerically equivalent to FSDP across optimizers, mixed precision, gradient scaling, and weight decay; quantify any nondeterminism due to reordering of gradient arrivals.
- Asynchronous or bounded-staleness variants: Design ODC variants with relaxed synchronization (e.g., stale synchronous parallel) and analyze convergence and stability for LLM post-training under realistic heterogeneity.
- Fault tolerance and elasticity: Add PS-style resilience (node join/leave, resharding, recovery after failures) to ODC; study consistency guarantees and throughput impacts of elastic resizing compared to collective-based systems.
- Communication congestion control: Develop principled RDMA flow-control for ODC (payload sizing, backpressure, QoS), and formally validate daemon-based gradient accumulation for correctness under concurrent pushes and heavy contention.
- Memory overhead and buffer management: Quantify per-server buffer memory for scatter-accumulate at scale (very large models, many clients), and design adaptive buffering or compression to avoid OOM and minimize footprint.
- Security and isolation in RDMA: Evaluate safety of remote GPU memory access in shared clusters; propose access control, sandboxing, and isolation mechanisms compatible with ODC’s on-demand transfers.
- Parameter tying and cross-layer dependencies: Specify and test how ODC handles shared weights (e.g., tied embeddings) and cross-layer parameter dependencies without reintroducing synchronization bottlenecks.
- Gradient compression/quantization: Explore compatibility of ODC with gradient compression (e.g., 8-bit, sparsification) or error-feedback to reduce bandwidth while preserving accuracy.
- Dynamic fallback and hybridization: Create policies that switch between ODC and collectives at runtime based on workload balance and sequence lengths; improve hybrid sharding selection beyond the current heuristic.
- Load balancing optimality and guarantees: Provide theoretical analysis for LB-Mini (minibatch-level balancing) under memory constraints, approximation guarantees vs. optimal packing, and fairness criteria across devices.
- Integration with 3D/4D parallelism: Study ODC with pipeline, tensor, and expert (MoE) parallelism; ensure shard ownership, routing, and scheduling work coherently without reintroducing fine-grained barriers.
- Scalability beyond 32 GPUs: Demonstrate ODC at larger scales (hundreds to thousands of GPUs), quantify straggler mitigation vs. cross-node overhead, and identify scaling limits.
- End-to-end training quality: Move beyond throughput to evaluate time-to-target quality, final accuracy, stability across seeds, and sample efficiency in both SFT and RL; include long training runs and diverse optimizers.
- RL pipeline evaluation: Measure end-to-end RL (including actor rollout), remove the constraint of equal samples per device in verl, and assess effects on policy quality and training throughput.
- Scheduling and fairness: Design straggler-aware policies for minibatch-end synchronization (timeouts, missing gradients, retries) that balance throughput with correctness; quantify impacts on update latency and fairness.
- Framework portability: Provide production-ready integration paths for PyTorch/DeepSpeed/JAX, with debugging and profiling tools, and document deployment prerequisites (NVSHMEM availability, RDMA config).
- Theoretical runtime bounds: Extend the paper’s per-layer bound to minibatch-level ODC, derive expected speedups under heterogeneous workloads, and relate them to observed bubble rates.
- Caching strategies for parameter shards: Investigate cross-layer/microbatch caching of fetched shards, coherence protocols, eviction policies, and memory trade-offs.
- Multi-tenant and congested network scenarios: Evaluate ODC under realistic datacenter congestion, mixed workloads, and background traffic; propose isolation or scheduling to maintain performance.
- Comparative baselines: Benchmark against advanced collective optimizations (e.g., ZeRO++, elastic collectives, hierarchical all-reduce) and ablate overlap, hybrid sharding, and packing contributions to isolate ODC’s net benefit.
- Robust OOM modeling: Improve memory-feasibility checks beyond sequence length (include activation checkpointing, attention variants, optimizer states) to avoid runtime OOM in packing decisions.
- Applicability beyond post-training: Test ODC in pretraining and other domains (vision, speech) to understand when imbalance tolerance still yields gains and when collectives remain preferable.
Practical Applications
Immediate Applications
Below are actionable use cases that can be deployed now, derived from ODC’s decentralized-PS adaptation of FSDP and its minibatch-level load balancing.
- Sector: Software/AI Infrastructure — Drop‑in acceleration for long‑context LLM post‑training
- What: Replace per‑layer collectives in PyTorch FSDP with ODC’s point‑to‑point gather and scatter‑accumulate to reduce synchronization barriers, especially under variable sequence lengths.
- Why: Up to 36% throughput gains in supervised fine‑tuning (SFT) and ~10% in RL tasks by mitigating straggler effects and reducing device idle time.
- Tools/Workflows: Open-source ODC library (https://github.com/sail-sg/odc), PyTorch FSDP integration, Triton‑Distributed for RDMA kernels, CUDA IPC (intra‑node), NVSHMEM (inter‑node).
- Assumptions/Dependencies: Best for imbalanced workloads (long/variable sequences); requires RDMA/NVSHMEM availability; benefits are largest intra‑node or when computation dominates communication.
- Sector: MLOps/Cloud — Cost and energy reduction through utilization gains
- What: Integrate ODC into training templates to reduce GPU idle time (bubble rate) on long-context jobs, lowering GPU-hour spend and energy use.
- Why: Device decoupling reduces synchronization stalls; fewer wasted GPU cycles translates to lower cost and carbon footprint.
- Tools/Workflows: “ODC‑optimized” training recipes for Torch Distributed, monitoring dashboards that surface bubble rate and throughput before/after ODC.
- Assumptions/Dependencies: Cluster must support CUDA IPC and preferably NVSHMEM; teams must adopt basic telemetry to quantify gains.
- Sector: Software Engineering (Agents), Education, Legal, Finance, Healthcare — Faster long‑context SFT
- What: Use ODC for SFT on domains with long documents (code repos, textbooks, contracts, filings, clinical notes).
- Why: Long sequences create severe workload skew; ODC’s minibatch‑level sync avoids per-layer barriers, speeding training of long-context models/agents.
- Tools/Workflows: LongAlign-style pipelines; advanced sequence packing; token‑weighted minibatching; integration with modern attention kernels (e.g., FlashAttention).
- Assumptions/Dependencies: Gains scale with sequence length variance; requires memory‑aware packing to stay within device limits.
- Sector: RL for Reasoning/Agents — Higher-throughput policy training
- What: Apply ODC to GRPO/PPO-like training loops where prompt/trajectory lengths vary (e.g., math reasoning, coding agents).
- Why: Decoupled device progress alleviates microbatch-level variability; demonstrated speedups without changing RL semantics.
- Tools/Workflows: Integration into verl and similar RL frameworks; optimized two‑level partitioning (minibatch-first) for load balancing.
- Assumptions/Dependencies: Some RL stacks assume equal microbatch counts per device—lifting that constraint (as ODC enables) yields larger gains.
- Sector: Distributed Systems/Academia — Simplified load balancing via LB‑Mini
- What: Adopt minibatch‑level balancing (LB‑Mini) that assigns different numbers of microbatches per device based on compute cost (e.g., Karmarkar‑Karp), then pack locally under memory constraints.
- Why: Coarser balancing is simpler and more effective when microbatch packing space is tight; removes layer-by-layer coupling.
- Tools/Workflows: Implement compute-cost partitioning in the data loader; per‑device local packing; correctness maintained with minibatch-level sync.
- Assumptions/Dependencies: Framework must permit per‑device variability in microbatch counts; best with token-to-compute skew (O(s2)).
- Sector: Multi‑node Training — Hybrid sharding to mitigate inter‑node overhead
- What: Shard parameters/gradients within nodes while sharding optimizer states across nodes to avoid cross‑node parameter gathers and gradient pushes.
- Why: ODC’s point‑to‑point is less bandwidth‑efficient than NCCL collectives across nodes; hybrid sharding cuts inter‑node traffic with modest memory tradeoffs.
- Tools/Workflows: ZeRO++‑style hybrid sharding; memory budgeting per node to absorb larger intra‑node shards.
- Assumptions/Dependencies: Requires enough per‑node memory headroom; network topology awareness helpful.
- Sector: Cluster Ops — Robustness to stragglers and heterogeneity
- What: Run ODC on shared or mixed‑hardware clusters (e.g., varied GPU models or background noise) to prevent fast workers from stalling.
- Why: Minibatch‑level sync tolerates queueing and minor performance asymmetries; better throughput without perfect homogeneity.
- Tools/Workflows: Scheduler policies that allow heterogenous allocation; ODC‑aware job configs.
- Assumptions/Dependencies: Gains are largest when heterogeneity or load imbalance is meaningful; convergence remains synchronous at minibatch boundary.
- Sector: Education/Training — Teaching and benchmarking modern DP systems
- What: Use ODC to illustrate tradeoffs between collectives and parameter‑server‑style schemes in coursework and labs; benchmark bubble rates and throughput under imbalance.
- Why: Realistic training dynamics (variable lengths, microbatches) are increasingly central to LLM curricula and systems research.
- Tools/Workflows: ODC repo and example notebooks; bubble rate instrumentation; parametric studies (batch size, length, packing ratio).
- Assumptions/Dependencies: Access to a multi‑GPU node for in‑class demos; optional multi‑node for advanced labs.
Long-Term Applications
Below are opportunities that require further research, scaling, or development before broad deployment.
- Sector: AI Infrastructure — Asynchronous/Bounded‑staleness ODC for further utilization gains
- What: Extend ODC to bounded‑staleness or fully async SGD to relax even the minibatch‑level barrier.
- Why: Further reduces idle time in highly heterogeneous or unstable environments (e.g., preemptible instances).
- Tools/Workflows: ODC variants with staleness control; convergence analysis and safeguards.
- Assumptions/Dependencies: Requires theoretical/empirical convergence validation for LLM post‑training; careful optimizer design.
- Sector: Communications/Systems — Topology‑aware and hierarchical ODC
- What: Add node‑local caching and hierarchical paths (e.g., fetch from a peer on the same node; aggregate pushes intra‑node before inter‑node) to recover NCCL‑like multi‑node efficiency.
- Why: Current ODC is bandwidth‑limited across nodes; topology‑aware routing can close the gap.
- Tools/Workflows: Enhanced ODC runtime with cache coherence; integration with NCCL for intra‑node and ODC for inter‑node; Triton compiler support.
- Assumptions/Dependencies: Nontrivial engineering; depends on NIC/NVSwitch capabilities and SHMEM semantics.
- Sector: Frameworks — Dynamic hybrid switching between collectives and ODC
- What: Runtime autotuner that selects ODC or collectives per phase, layer group, or minibatch based on measured imbalance and token lengths.
- Why: Achieve close‑to‑optimal performance across a wide range of workloads and cluster topologies.
- Tools/Workflows: PyTorch plugins, cost models, and online profiling to switch modes; integration with schedulers.
- Assumptions/Dependencies: Low‑overhead decision logic; robust fallbacks; requires standardized performance counters.
- Sector: Multimodal/Robotics — Extending imbalance‑tolerant training beyond text
- What: Apply ODC to domains with variable compute per sample (e.g., video frames, audio durations, RL rollouts, robot trajectories).
- Why: Similar imbalance patterns (variable sequence lengths or compute graphs) benefit from device progress decoupling.
- Tools/Workflows: ODC integration into multimodal frameworks and RL libraries; domain‑specific packing.
- Assumptions/Dependencies: Must validate memory/compute scaling properties and convergence in non‑text domains.
- Sector: Energy/Cloud Scheduling — Energy‑aware, imbalance‑tolerant schedulers
- What: Scheduler policies that steer long‑context jobs to ODC‑enabled nodes and co‑locate jobs to maximize compute/communication overlap.
- Why: Aligns workload characteristics with hardware/network strengths to minimize energy per sample trained.
- Tools/Workflows: Slurm/Kubernetes plugins; cluster‑level cost models that account for ODC behavior.
- Assumptions/Dependencies: Requires coordination between training stack and cluster scheduler; accurate job telemetry.
- Sector: Reliability — Elasticity and fault tolerance for LLM post‑training
- What: Incorporate PS‑style elasticity into ODC (join/leave nodes mid‑training, automatic recovery from failures).
- Why: Collectives are brittle to faults/resizes; PS architectures naturally permit elastic scaling and fault recovery.
- Tools/Workflows: Checkpointing compatible with sharded states; membership management; consistent optimizer state updates.
- Assumptions/Dependencies: Protocols for state continuity across elastic events; potential throughput tradeoffs.
- Sector: Standards/Policy — Best‑practice guidance for public compute and sustainability
- What: Draft guidelines that encourage imbalance‑aware training (ODC or equivalent) for long‑context workloads in public‑funded clusters and report “bubble rate” as an efficiency metric.
- Why: Reduces energy waste and costs; supports responsible compute usage mandates.
- Tools/Workflows: Standardized metrics and benchmarking suites; procurement language that requires imbalance‑tolerant solutions.
- Assumptions/Dependencies: Community and vendor buy‑in; updates to existing MLPerf‑like benchmarks.
- Sector: Hardware Co‑design — NIC/GPU features for fine‑grained RDMA between GPUs
- What: Hardware and firmware support for efficient, secure point‑to‑point GPU RDMA and server‑side gradient accumulation at scale.
- Why: Close ODC’s inter‑node performance gap; enable on‑device daemons and low‑overhead notification paths.
- Tools/Workflows: Vendor driver updates; NVLink/NVSwitch/NIC roadmap alignment; SHMEM extensions.
- Assumptions/Dependencies: Multi‑year hardware cycles; security isolation for cross‑process/device memory access.
- Sector: Productization — Turnkey “long‑context fine‑tuning” and “agent‑training” kits
- What: Commercial offerings that bundle ODC‑enabled training stacks, packing algorithms, and best‑practice configs for long‑context SFT and RLHF.
- Why: Reduce operational burden; accelerate adoption by enterprises with long documents/codebases.
- Tools/Workflows: Prebuilt Docker images, Helm charts, and reference pipelines with monitoring and autoscaling.
- Assumptions/Dependencies: Customer clusters must expose RDMA/NVSHMEM; support agreements for low‑level runtime.
Notes on feasibility across all applications:
- ODC is most beneficial when compute scales super‑linearly with sequence length (e.g., attention O(s2)) and when microbatch memory constraints force imbalanced packing.
- Inter‑node performance depends on overlapping communication with heavy compute or adopting hybrid sharding to limit cross‑node traffic.
- Some frameworks (especially RL) currently assume uniform microbatch counts per device; unlocking ODC’s full benefit may require relaxing such constraints.
Glossary
- Activation memory: The memory required to store intermediate activations during forward/backward passes; in transformers it typically scales linearly with sequence length. "activation memory grows linearly"
- all-gather: A collective operation that gathers shards from all devices so each device gets the full tensor; used in FSDP to reconstruct parameters per layer. "AG = all-gather; RS = reduce-scatter."
- AllReduce: A collective communication pattern that aggregates and distributes values across devices, often used for gradient synchronization. "Baidu AllReduce"
- asynchronous SGD: A family of stochastic gradient descent methods where parameter updates do not wait for all workers, reducing synchronization but introducing staleness. "classic asynchronous SGD schemes (Recht et al., 2011)"
- bounded-staleness updates: Consistency schemes allowing a limited delay (staleness) between workers’ views of parameters to reduce synchronization overhead. "such as bounded-staleness updates (Chen et al., 2016; Ho et al., 2013)"
- collective communication: Multi-party communication primitives (e.g., all-gather, reduce-scatter) that require coordinated participation of all ranks. "Modern data parallel (DP) training favors collective communication over parame- ter servers (PS)"
- CUDA IPC: NVIDIA’s intra-node GPU-to-GPU interprocess communication mechanism enabling direct memory access across GPUs on the same host. "CUDA IPC (NVIDIA, a)"
- data parallel (DP): A distributed training strategy where model replicas process different data shards and synchronize updates. "Modern data parallel (DP) training favors collective communication"
- decentralized parameter server: A PS design where parameter shards and optimizer states are colocated across all worker devices, avoiding a central bottleneck. "reframes FSDP as a decentralized PS"
- elastic scalability: The capability of a distributed system to change the number of workers during training without stopping or reconfiguring extensively. "exploring elastic scalability with continuous fault tolerance"
- FSDP (Fully Sharded Data Parallel): A memory-efficient data-parallel scheme that shards parameters, gradients, and optimizer states across devices with per-layer collectives. "Fully Sharded Data Parallel (FSDP)"
- fault tolerance: The ability of a training system to continue operating correctly in the presence of node or network failures. "continuous fault tolerance"
- gather (ODC primitive): In ODC, a point-to-point operation where a device pulls the specific parameter shard it needs from peers. "An all-gather is replaced by a series of targeted gather requests"
- gradient accumulation: Summing per-microbatch gradients before applying an optimizer update to emulate larger effective batch sizes. "and accumulate gradients before performing the optimizer update."
- GRPO: A reinforcement learning algorithm (used in experiments) for optimizing LLM reasoning performance. "we run GRPO (Guo et al., 2025; Liu et al., 2025)"
- hierarchical interconnects: Multi-level network structures (e.g., intra-node and inter-node) exploited to optimize communication efficiency. "exploiting hierarchical interconnects in multi- node settings."
- Horovod: A distributed training framework implementing efficient collective operations (notably ring-allreduce) across deep learning platforms. "Horovod (Sergeev & Del Balso, 2018)"
- hybrid sharding: A strategy that shards parameters/gradients within a node and optimizer states across nodes to reduce cross-node traffic. "Hybrid Sharding. When the tokens per microbatch is too small to hide communication costs, hybrid sharding provides an effective solution."
- Karmarkar–Karp algorithm: A heuristic for the number partitioning problem used to balance computational workloads across devices. "We use the Karmarkar-Karp algorithm (Karmarkar & Karp, 1982)"
- microbatch: A subdivision of a minibatch processed in separate forward/backward passes to fit memory constraints. "within a microbatch."
- NCCL: NVIDIA’s Collective Communications Library providing optimized multi-GPU/multi-node collective operations. "NCCL (NVIDIA, b)"
- NVSHMEM: NVIDIA’s OpenSHMEM implementation enabling one-sided, RDMA-based GPU memory operations across nodes. "NVSHMEM (NVIDIA, c)"
- NVSwitch: NVIDIA’s high-bandwidth intra-node switch interconnect used to connect multiple GPUs within a server. "with NVSwitch for intra-node communication"
- ODC (On-Demand Communication): The proposed scheme replacing collectives with point-to-point operations to relax synchronization to minibatch boundaries. "We propose On-Demand Communication (ODC)"
- packing ratio: A parameter controlling how many tokens are allowed per microbatch relative to the maximum sequence length. "Packing ratio: the maximum number of tokens allowed in a microbatch divided by the max sequence length"
- parameter server (PS): A distributed training architecture where servers store parameters and workers compute and push/pull updates. "the PS architecture"
- point-to-point communication: Direct communication between two devices (ranks) without coordinated participation of all workers. "direct point-to-point communication"
- RDMA: Remote Direct Memory Access, enabling direct memory reads/writes across nodes without CPU involvement on the target side. "RDMA-based interfaces"
- reduce-scatter: A collective that reduces (e.g., sums) data across devices and scatters the result so each device retains a shard. "AG = all-gather; RS = reduce-scatter."
- ring-based methods: Collective algorithms organizing devices in a ring to achieve bandwidth-efficient, scalable communication. "Ring-based methods, as demonstrated in Baidu AllReduce (Research, 2017) and Horovod (Sergeev & Del Balso, 2018), reduced bandwidth requirements while scaling pre- dictably."
- RoCE: RDMA over Converged Ethernet, providing RDMA capabilities on Ethernet networks. "RoCE RDMA (800. Gbps per node)"
- scatter-accumulate: In ODC, a point-to-point operation where workers push gradients to the owning device, which accumulates them. "scatter-accumulate operations"
- sequence packing: Concatenating multiple sequences with attention masks to reduce padding and balance compute across batches. "the strategy of sequence packing"
- sharding: Partitioning tensors (parameters/gradients/optimizer states) across devices to reduce per-device memory usage. "By sharding parameters, gradients, and optimizer states across devices"
- straggler effects: Slow workers delaying overall progress in synchronized systems, causing idle time on faster devices. "mitigates straggler effects"
- synchronization barriers: Points where all devices must wait and align before proceeding, often induced by collectives. "These per-layer collectives create fundamental synchronization barriers"
- topology-aware collectives: Collective algorithms that exploit hardware/network topology to optimize communication paths. "similar to topology-aware collectives."
- Triton: A GPU programming language and compiler for writing high-performance kernels for deep learning. "a Triton (Tillet et al., 2019) wrapper"
- Triton-Distributed: A framework exposing RDMA functionality within Triton kernels to program overlapping distributed operations. "Triton-Distributed (Zheng et al., 2025)"
- ZeRO: A family of memory-optimization techniques that shard model states to enable training very large models. "exemplified by ZeRO (Rajbhandari et al., 2020)"
- ZeRO++: An extension improving collective communication efficiency for large-model training. "Similar to ZeRO++ (Wang et al., 2024)"
Collections
Sign up for free to add this paper to one or more collections.