Papers
Topics
Authors
Recent
Search
2000 character limit reached

EF21-SGDM: Error Feedback SGD with Momentum

Updated 29 January 2026
  • The paper introduces EF21-SGDM, a distributed optimization algorithm that combines error feedback, stochastic gradients, and Polyak momentum to achieve optimal communication and sample complexity in nonconvex settings.
  • The methodology employs error feedback memory, momentum buffers, and contractive compressors to overcome limitations of prior approaches without relying on large batch sizes or bounded gradients.
  • Theoretical and empirical analyses demonstrate that EF21-SGDM attains O(1/T) convergence rates and relaxes restrictive assumptions, paving the way for more efficient distributed deep learning.

EF21-SGDM is a distributed optimization algorithm that combines advanced error feedback (EF21), stochastic gradient descent (SGD), and Polyak momentum, introducing theoretical and practical improvements in communication- and sample-efficient training of machine learning models under contractive-compressor regimes. The method relaxes strong assumptions prevalent in previous error-feedback approaches and achieves optimal communication and sample complexity in nonconvex settings, notably without requiring large batch sizes or bounded gradient dissimilarity conditions (Fatkhullin et al., 2021).

1. Error Feedback Foundations and EF21

Error feedback has emerged as a key approach enabling the convergence of distributed gradient methods under lossy (contractive) communication schemes. The EF21 mechanism, introduced by arik et al. (2021), leverages a Markov compressor induced by a contractive compressor CC with parameter α∈(0,1]\alpha \in (0,1], satisfying

E[∥C(u)−u∥2]≤(1−α)∥u∥2.\mathbb{E}[\|C(u)-u\|^2] \leq (1-\alpha)\|u\|^2.

EF21 mitigates limitations of previous heuristics (e.g., EF14 [Seide 2014]), such as pessimistic O(1/T2/3)O(1/T^{2/3}) rates and reliance on bounded-gradient conditions, by delivering O(1/T)O(1/T) convergence in the smooth nonconvex regime and supporting strong theoretical guarantees (Fatkhullin et al., 2021).

2. Algorithmic Structure of EF21-SGDM

EF21-SGDM is an adaptation incorporating stochastic gradients and momentum to the baseline EF21 error feedback protocol. Each of nn distributed workers maintains local states {git,vit}\{g_i^t, v_i^t\}, where gitg_i^t is the error-feedback memory and vitv_i^t is the momentum buffer. At each iteration tt:

  • The master broadcasts the global parameter xtx^t.
  • Each worker computes a stochastic gradient estimate ∇fi(xt,ξit)\nabla f_i(x^t, \xi_i^t) (with bounded variance σ2\sigma^2) and updates its local momentum buffer:

vit+1=(1−η)vit+η∇fi(xt+1,ξit+1).v_i^{t+1} = (1 - \eta)v_i^t + \eta\nabla f_i(x^{t+1}, \xi_i^{t+1}).

  • The compression error is calculated:

cit+1=C(vit+1−git),git+1=git+cit+1.c_i^{t+1} = C(v_i^{t+1} - g_i^t), \quad g_i^{t+1} = g_i^t + c_i^{t+1}.

  • The master aggregates gt+1=1n∑i=1ngit+1g^{t+1} = \frac{1}{n}\sum_{i=1}^n g_i^{t+1}.
  • The global iterate is updated with stepsize γ\gamma:

xt+1=xt−γgt+1.x^{t+1} = x^t - \gamma g^{t+1}.

Momentum parameter η\eta is typically selected adaptively. In practice, Top-K compressors govern α≈K/d\alpha \approx K/d (Fatkhullin et al., 2021).

3. Convergence Analysis and Complexity Bounds

Under standard smoothness (LiL_i-Lipschitz) and bounded-variance assumptions, EF21-SGDM achieves the following convergence guarantee in the nonconvex regime:

E∥∇f(x^)∥2≤L δ0αT+(Lδ0 σ2)1/2T1/2+σ2nT1/2,\mathbb{E}\|\nabla f(\hat x)\|^2 \leq \frac{L\,\delta_0}{\alpha T} + \frac{(L\delta_0\,\sigma^2)^{1/2}}{T^{1/2}} + \frac{\sigma^2}{nT^{1/2}},

where δ0=f(x0)−f∗\delta_0 = f(x^0)-f^*. To reach E∥∇f(x^)∥2≤ε2\mathbb{E}\|\nabla f(\hat x)\|^2 \leq \varepsilon^2, the required number of iterations is

T=O(Lα−1ε−2+Lσ2(n−1ε−4)).T = O(L \alpha^{-1} \varepsilon^{-2} + L \sigma^2 (n^{-1} \varepsilon^{-4})).

This rate matches the optimal O(1/T)O(1/T) communication complexity of gradient descent but, crucially, does not require large mini-batch sizes (B=1B=1 suffices). In deep learning experiments, heavy-ball momentum (β≈0.9–0.99\beta \approx 0.9\text{--}0.99) improves empirical generalization without affecting the asymptotic rate (Fatkhullin et al., 2021).

4. Double-Momentum Variant: EF21-SGD2M

EF21-SGD2M generalizes EF21-SGDM by introducing an additional momentum buffer; specifically, each worker maintains (vit,uit)(v_i^t, u_i^t), updated as

vit+1=(1−η)vit+η∇fi(xt+1,ξit+1),uit+1=(1−η)uit+ηvit+1.v_i^{t+1} = (1-\eta)v_i^t + \eta\nabla f_i(x^{t+1}, \xi_i^{t+1}), \quad u_i^{t+1} = (1-\eta)u_i^t + \eta v_i^{t+1}.

Compression and error feedback proceed from uit+1u_i^{t+1}. The resulting convergence rate retains the O(α−1ε−2)O(\alpha^{-1} \varepsilon^{-2}) communication cost and eliminates the suboptimal ε−3\varepsilon^{-3} term appearing in previous error feedback methods’ sample complexity bounds (Fatkhullin et al., 2021).

5. Practical Guidelines and Parameter Selection

Table: Recommended EF21-SGDM parameter regimes.

Parameter Typical Value Remarks
Compressor α\alpha K/dK/d (Top-KK) Governs compression rate
Stepsize γ\gamma ≈1/(2L)\approx 1/(2L) May be enlarged (0,2)(0,2)
Momentum β\beta $0.9$–$0.99$ Lower in noisy settings
Batch size B=1B=1 Large batches not needed

EF21-SGDM is robust to small batch sizes and noisy communication, and does not rely on extra bounded-gradient or similarity assumptions.

6. Quantitative Comparison to Prior Art

Table: Quantitative comparison of error feedback variants under identical smoothness, variance, and Top-K compression (Fatkhullin et al., 2021).

Method Comm. (Top K) Sample (per node) Batch-free Extra assumptions
EF14-SGD O(Kα−1ε−3+Kσ2n−1ε−4)O(K\alpha^{-1}\varepsilon^{-3}+K\sigma^2 n^{-1}\varepsilon^{-4}) O(α−1ε−3+σ2n−1ε−4)O(\alpha^{-1}\varepsilon^{-3}+ \sigma^2 n^{-1}\varepsilon^{-4}) Yes Needs bounded gradients
EF21-SGD O(Kα−1ε−2)O(K\alpha^{-1}\varepsilon^{-2}) (large BB) O(Kσ2α−2n−1ε−4)O(K\sigma^2\alpha^{-2}n^{-1}\varepsilon^{-4}) No No BG/BGS
BEER O(Kα−1ε−2)O(K\alpha^{-1}\varepsilon^{-2}) (large BB) O(Kσ2α−1n−1ε−4)O(K\sigma^2\alpha^{-1}n^{-1}\varepsilon^{-4}) No No BG/BGS
EF21-SGDM O(Kε−2)O(K\varepsilon^{-2}) O(Kσ2n−1ε−4)O(K\sigma^2 n^{-1} \varepsilon^{-4}) Yes None
EF21-SGD2M O(Kε−2)O(K\varepsilon^{-2}) O(Kσ2n−1ε−4)O(K\sigma^2 n^{-1} \varepsilon^{-4}) Yes None

EF21-SGDM achieves optimal communication and sample complexities absent the restrictive assumptions and large batch requirements of preceding algorithms.

7. Theoretical Insights and Proof Techniques

The convergence analysis of EF21-SGDM employs a Lyapunov function combining objective gap, momentum buffer variance, and error feedback memory, enabling precise tracking of iterates and error propagation:

Ψt:=f(xt)−f∗+γη∥vt−∇f(xt)∥2+γηα2n∑i∥vit−∇fi(xt)∥2.\Psi_t := f(x^t)-f^* + \frac{\gamma}{\eta}\|v^t - \nabla f(x^t)\|^2 + \frac{\gamma\eta}{\alpha^2 n} \sum_i \|v_i^t - \nabla f_i(x^t)\|^2.

Key technical recurrences establish control over momentum-variance and compression error, facilitating optimal O(1/T)O(1/T) convergence rates in the distributed nonconvex setting (Fatkhullin et al., 2021).

A plausible implication is that these techniques are of independent interest in nonconvex stochastic optimization with momentum, as the analysis remains robust even without incorporating compression. This positions EF21-SGDM and EF21-SGD2M as rigorous foundations for future investigations into communication-efficient distributed learning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EF21-SGDM.