OLDSGD: Overlapping Local Decentralized SGD
- OLDSGD is a distributed optimization method that overlaps communication and computation by exchanging stale model states and performing scheduled combine-then-adapt updates.
- It achieves improved wall-clock efficiency and maintains convergence rates equivalent to centralized SGD in smooth, nonconvex settings.
- Empirical results show near-linear scalability and significant training speedup compared to traditional decentralized and synchronous methods.
Overlapping Local Decentralized SGD (OLDSGD) is a distributed optimization method designed to maximize communication-computation overlap in decentralized stochastic gradient descent. OLDSGD achieves improved wall-clock efficiency and preserves the theoretical convergence rate of standard Local Decentralized SGD schemes, including in smooth nonconvex settings. It is applicable to a network of agents connected by an undirected graph, each holding a portion of the global objective; OLDSGD operates by exchanging stale model states and performing a “combine-then-adapt” update at scheduled consensus rounds. The method maintains network-wide average consistency with centralized SGD, ensuring no staleness-induced bias. Empirical results demonstrate substantial improvements in training speed relative to prior decentralized algorithms, with minimal implementation overhead (Wang et al., 2020, Zhou et al., 4 Jan 2026).
1. Algorithmic Framework
OLDSGD considers the problem
where each agent holds a local objective , and communicates within an undirected topology represented by a symmetric, doubly-stochastic mixing matrix .
Update Mechanism:
- Each agent performs local SGD steps between consensus rounds.
- During non-consensus iterations (), agent updates:
where is the stochastic gradient.
- At consensus rounds (), each agent executes the combine-then-adapt update:
Agents exchange model states from , overlapping communication with computation.
Average-Model Property:
The network-wide average model
obeys the centralized SGD recursion
for every .
2. Theoretical Properties and Convergence
OLDSGD’s theoretical analysis in (Zhou et al., 4 Jan 2026) establishes convergence rates under the following assumptions:
- The mixing matrix is doubly-stochastic; the graph is connected.
- Each is -smooth.
- Stochastic gradients are unbiased with bounded variance and heterogeneity.
Main Result:
If the stepsize is sufficiently small, the ergodic average gradient norm satisfies
where (with being the second-largest eigenvalue of ); . Setting yields
thus matching the rate of standard Local Decentralized SGD for smooth nonconvex objectives (Zhou et al., 4 Jan 2026).
This suggests that OLDSGD provides theoretical guarantees equivalent to non-overlapping decentralized variants, but with improved runtime characteristics due to reduced synchronization waiting.
3. Communication-Overlap and Practical Efficiency
The primary innovation of OLDSGD is the strict separation of consensus communication and local computation:
- Stale model states () are sent as soon as available using non-blocking primitives (MPI_Isend/Irecv or torch.distributed.isend/irecv).
- Local computation proceeds unabated during communication, with model states arriving by the next consensus round.
- Consensus steps “combine” the received states and “adapt” with the accumulated local gradients (CTA).
Latency Hiding:
No node idles waiting for communication, ensuring maximal hardware utilization. Especially when consensus latency exceeds computation time per gradient step, OLDSGD leverages increased to further amortize communication overhead.
Implications:
OLDSGD remains robust to heterogeneous node delays (“straggler” effects), as delayed models are simply those available at consensus time, and the protocol’s average update guarantees algorithmic correctness.
4. Empirical Evaluation and Performance
Experimental results from (Zhou et al., 4 Jan 2026) on benchmark tasks include vision (VGG-11, ResNet-18 on CIFAR-10) and language (GPT2-Small on WikiText-2), under both homogeneous and heterogeneous data splits:
- Hardware: 8× NVIDIA RTX 3090 (vision), agent counts , undirected ring topology.
- Communication delays modeled as equivalent to multiple gradient steps ().
- Step-size (vision), local batch size 8 for vision, 2 for language.
Wall-clock Convergence:
For various consensus delays (), OLDSGD consistently achieves faster convergence in real time than LDSGD, KGT, LUGT, ring-allreduce LSGD, and Exact Diffusion (Zhou et al., 4 Jan 2026).
- Speedup vs LDSGD: 1.64× (geometric mean over tasks and splits)
- Speedup vs ring-allreduce LSGD: 1.98×
- Speedup vs LUGT: 3.05×
- With large , speedup reaches 3.32× (GPT2).
Scalability:
Scaling from to , OLDSGD achieves near-linear wall-clock speedup (e.g., ~14× for VGG-11 at 16 nodes), though further scaling in ring topologies is constrained by network connectivity.
5. Implementation Considerations
Integrating OLDSGD into existing distributed learning frameworks requires minimal changes:
- Stale Model Exchange: Agents tag and transmit every steps, using non-blocking sends/receives.
- CTA Consensus Update: At consensus time, agents buffer the past local gradients and execute
- Hyperparameters:
- Stepsize : 0.01 (vision), 5e–5–1e–4 (language)
- Local steps : Small () for low-latency interconnects; larger ($10$–$40$) for high-latency networks.
- Mixing weights : Uniform on ring ( left/right).
Editor's term: The combination of stale communication and “combine-then-adapt” update can be referenced as the OLDSGD CTA protocol.
6. Relationship to Prior and Contemporary Methods
OLDSGD generalizes the periodic averaging strategies found in Local SGD and Local Decentralized SGD, eliminating communication/idle bottlenecks typical of frequent synchronization. The protocol’s average-model equivalence to centralized SGD distinguishes it from schemes incurring staleness bias.
In (Wang et al., 2020), overlapping communication and computation is achieved via anchor models and non-blocking AllReduce, allowing mitigation of straggler effects and matching the convergence rate of fully synchronous SGD. Momentum variants further reduce oscillations in heterogeneous data scenarios.
A plausible implication is that OLDSGD’s general framework of exchanging stale states and performing locally accumulated gradient updates can be extended to alternative optimizers (Adam, RMSProp) or fully decentralized gossip-matrix mixing, substituting AllReduce consensus for scalable arbitrary topologies (Wang et al., 2020, Zhou et al., 4 Jan 2026).
7. Limitations and Extensions
The consensus error term in OLDSGD’s convergence is proportional to , with scaling benefits saturating as network topology connectivity degrades (e.g., large rings). Tuning provides a trade-off between overlap efficiency and consensus error. The theoretical justification requires doubly-stochastic mixing and bounded noise/heterogeneity. Extensions may include adapting the CTA update to asynchronous or time-varying graphs, as well as incorporating momentum (anchor or local) to further improve stability in non-IID or straggler-prone environments (Wang et al., 2020).
OLDSGD’s structure enables compatibility with common distributed frameworks (PyTorch DDP, MPI), requiring only stale-model communication and CTA update rule changes, and is thus a practical method for efficient decentralized learning under realistic networking constraints.