Papers
Topics
Authors
Recent
Search
2000 character limit reached

OLDSGD: Overlapping Local Decentralized SGD

Updated 11 January 2026
  • OLDSGD is a distributed optimization method that overlaps communication and computation by exchanging stale model states and performing scheduled combine-then-adapt updates.
  • It achieves improved wall-clock efficiency and maintains convergence rates equivalent to centralized SGD in smooth, nonconvex settings.
  • Empirical results show near-linear scalability and significant training speedup compared to traditional decentralized and synchronous methods.

Overlapping Local Decentralized SGD (OLDSGD) is a distributed optimization method designed to maximize communication-computation overlap in decentralized stochastic gradient descent. OLDSGD achieves improved wall-clock efficiency and preserves the theoretical convergence rate of standard Local Decentralized SGD schemes, including in smooth nonconvex settings. It is applicable to a network of agents connected by an undirected graph, each holding a portion of the global objective; OLDSGD operates by exchanging stale model states and performing a “combine-then-adapt” update at scheduled consensus rounds. The method maintains network-wide average consistency with centralized SGD, ensuring no staleness-induced bias. Empirical results demonstrate substantial improvements in training speed relative to prior decentralized algorithms, with minimal implementation overhead (Wang et al., 2020, Zhou et al., 4 Jan 2026).

1. Algorithmic Framework

OLDSGD considers the problem

minxRd f(x)=i=1nfi(x)\min_{x\in\mathbb{R}^d}~ f(x) = \sum_{i=1}^n f_i(x)

where each agent ii holds a local objective fi(x)f_i(x), and communicates within an undirected topology represented by a symmetric, doubly-stochastic mixing matrix W=[wij]W=[w_{ij}].

Update Mechanism:

  • Each agent performs τ\tau local SGD steps between consensus rounds.
  • During non-consensus iterations (tmodτ0t\mod\tau \neq 0), agent ii updates:

xit=xit1αgit1x_i^t = x_i^{t-1} - \alpha\,g_i^{t-1}

where git1=F(xit1;ξit1)g_i^{t-1} = \nabla F(x_i^{t-1};\xi_i^{t-1}) is the stochastic gradient.

  • At consensus rounds (tmodτ=0t\mod\tau = 0), each agent ii executes the combine-then-adapt update:

xit=jNiwijxjtταk=tτt1gikx_i^t = \sum_{j\in\mathcal{N}_i} w_{ij}\,x_j^{t-\tau} - \alpha \sum_{k=t-\tau}^{t-1} g_i^k

Agents exchange model states from (tτ)(t-\tau), overlapping communication with computation.

Average-Model Property:

The network-wide average model

xˉt=1ni=1nxit\bar x^t = \frac{1}{n} \sum_{i=1}^n x_i^t

obeys the centralized SGD recursion

xˉt=xˉt1αni=1ngit1\bar x^{t} = \bar x^{t-1} - \frac{\alpha}{n} \sum_{i=1}^n g_i^{t-1}

for every tt.

2. Theoretical Properties and Convergence

OLDSGD’s theoretical analysis in (Zhou et al., 4 Jan 2026) establishes convergence rates under the following assumptions:

  • The mixing matrix is doubly-stochastic; the graph is connected.
  • Each fif_i is LL-smooth.
  • Stochastic gradients are unbiased with bounded variance and heterogeneity.

Main Result:

If the stepsize α\alpha is sufficiently small, the ergodic average gradient norm satisfies

1Tt=0T1Ef(xˉt)28(f(xˉ0)f)αT+8αLn(σ22+Mζ2)+1024L2Dτnpα2\frac{1}{T}\sum_{t=0}^{T-1} \mathbb{E}\|\nabla f(\bar x^t)\|^2 \leq \frac{8(f(\bar x^0)-f^*)}{\alpha T} + \frac{8\alpha L}{n}\left(\frac{\sigma^2}{2} + M\zeta^2\right) + \frac{1024L^2 D \tau}{np}\alpha^2

where p=1λ22p=1-\lambda_2^2 (with λ2\lambda_2 being the second-largest eigenvalue of WW); D=6((2τ+M)nζ2+nσ2)/pD=6((2\tau+M)n\zeta^2 + n\sigma^2)/p. Setting α=n/T\alpha=\sqrt{n/T} yields

1Tt=0T1Ef(xˉt)2=O(1nT+nT)\frac{1}{T}\sum_{t=0}^{T-1} \mathbb{E}\|\nabla f(\bar x^t)\|^2 = O\left(\frac{1}{\sqrt{nT}} + \frac{n}{T}\right)

thus matching the rate of standard Local Decentralized SGD for smooth nonconvex objectives (Zhou et al., 4 Jan 2026).

This suggests that OLDSGD provides theoretical guarantees equivalent to non-overlapping decentralized variants, but with improved runtime characteristics due to reduced synchronization waiting.

3. Communication-Overlap and Practical Efficiency

The primary innovation of OLDSGD is the strict separation of consensus communication and local computation:

  • Stale model states (xitτx_i^{t-\tau}) are sent as soon as available using non-blocking primitives (MPI_Isend/Irecv or torch.distributed.isend/irecv).
  • Local computation proceeds unabated during communication, with model states arriving by the next consensus round.
  • Consensus steps “combine” the received states and “adapt” with the accumulated local gradients (CTA).

Latency Hiding:

No node idles waiting for communication, ensuring maximal hardware utilization. Especially when consensus latency cc exceeds computation time per gradient step, OLDSGD leverages increased τ\tau to further amortize communication overhead.

Implications:

OLDSGD remains robust to heterogeneous node delays (“straggler” effects), as delayed models are simply those available at consensus time, and the protocol’s average update guarantees algorithmic correctness.

4. Empirical Evaluation and Performance

Experimental results from (Zhou et al., 4 Jan 2026) on benchmark tasks include vision (VGG-11, ResNet-18 on CIFAR-10) and language (GPT2-Small on WikiText-2), under both homogeneous and heterogeneous data splits:

  • Hardware: 8× NVIDIA RTX 3090 (vision), agent counts n{8,9}n\in\{8,9\}, undirected ring topology.
  • Communication delays modeled as equivalent to multiple gradient steps (cc).
  • Step-size α=0.01\alpha=0.01 (vision), local batch size 8 for vision, 2 for language.

Wall-clock Convergence:

For various consensus delays (c=1, 5c=1,~5), OLDSGD consistently achieves faster convergence in real time than LDSGD, KGT, LUGT, ring-allreduce LSGD, and Exact Diffusion (Zhou et al., 4 Jan 2026).

  • Speedup vs LDSGD: 1.64× (geometric mean over tasks and splits)
  • Speedup vs ring-allreduce LSGD: 1.98×
  • Speedup vs LUGT: 3.05×
  • With large cc, speedup reaches 3.32× (GPT2).

Scalability:

Scaling from n=2n=2 to n=16n=16, OLDSGD achieves near-linear wall-clock speedup (e.g., ~14× for VGG-11 at 16 nodes), though further scaling in ring topologies is constrained by network connectivity.

5. Implementation Considerations

Integrating OLDSGD into existing distributed learning frameworks requires minimal changes:

  • Stale Model Exchange: Agents tag and transmit xitτx_i^{t-\tau} every τ\tau steps, using non-blocking sends/receives.
  • CTA Consensus Update: At consensus time, agents buffer the past τ\tau local gradients and execute

xit=jwijxjtταk=tτt1gikx_i^t = \sum_{j} w_{ij} x_j^{t-\tau} - \alpha\sum_{k=t-\tau}^{t-1} g_i^k

  • Hyperparameters:
    • Stepsize α\alpha: 0.01 (vision), 5e–5–1e–4 (language)
    • Local steps τ\tau: Small (5\leq5) for low-latency interconnects; larger ($10$–$40$) for high-latency networks.
    • Mixing weights wijw_{ij}: Uniform on ring (12\frac12 left/right).

Editor's term: The combination of stale communication and “combine-then-adapt” update can be referenced as the OLDSGD CTA protocol.

6. Relationship to Prior and Contemporary Methods

OLDSGD generalizes the periodic averaging strategies found in Local SGD and Local Decentralized SGD, eliminating communication/idle bottlenecks typical of frequent synchronization. The protocol’s average-model equivalence to centralized SGD distinguishes it from schemes incurring staleness bias.

In (Wang et al., 2020), overlapping communication and computation is achieved via anchor models and non-blocking AllReduce, allowing mitigation of straggler effects and matching the convergence rate of fully synchronous SGD. Momentum variants further reduce oscillations in heterogeneous data scenarios.

A plausible implication is that OLDSGD’s general framework of exchanging stale states and performing locally accumulated gradient updates can be extended to alternative optimizers (Adam, RMSProp) or fully decentralized gossip-matrix mixing, substituting AllReduce consensus for scalable arbitrary topologies (Wang et al., 2020, Zhou et al., 4 Jan 2026).

7. Limitations and Extensions

The consensus error term in OLDSGD’s convergence is proportional to α2τ/p\alpha^2\tau/p, with scaling benefits saturating as network topology connectivity degrades (e.g., large rings). Tuning τ\tau provides a trade-off between overlap efficiency and consensus error. The theoretical justification requires doubly-stochastic mixing and bounded noise/heterogeneity. Extensions may include adapting the CTA update to asynchronous or time-varying graphs, as well as incorporating momentum (anchor or local) to further improve stability in non-IID or straggler-prone environments (Wang et al., 2020).

OLDSGD’s structure enables compatibility with common distributed frameworks (PyTorch DDP, MPI), requiring only stale-model communication and CTA update rule changes, and is thus a practical method for efficient decentralized learning under realistic networking constraints.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Overlapping Local Decentralized SGD (OLDSGD).