EF21-SGDM: Error Feedback SGD with Momentum

Updated 29 January 2026

The paper introduces EF21-SGDM, a distributed optimization algorithm that combines error feedback, stochastic gradients, and Polyak momentum to achieve optimal communication and sample complexity in nonconvex settings.
The methodology employs error feedback memory, momentum buffers, and contractive compressors to overcome limitations of prior approaches without relying on large batch sizes or bounded gradients.
Theoretical and empirical analyses demonstrate that EF21-SGDM attains O(1/T) convergence rates and relaxes restrictive assumptions, paving the way for more efficient distributed deep learning.

EF21-SGDM is a distributed optimization algorithm that combines advanced error feedback (EF21), stochastic gradient descent (SGD), and Polyak momentum, introducing theoretical and practical improvements in communication- and sample-efficient training of machine learning models under contractive-compressor regimes. The method relaxes strong assumptions prevalent in previous error-feedback approaches and achieves optimal communication and sample complexity in nonconvex settings, notably without requiring large batch sizes or bounded gradient dissimilarity conditions (Fatkhullin et al., 2021).

1. Error Feedback Foundations and EF21

Error feedback has emerged as a key approach enabling the convergence of distributed gradient methods under lossy (contractive) communication schemes. The EF21 mechanism, introduced by arik et al. (2021), leverages a Markov compressor induced by a contractive compressor $C$ with parameter $\alpha \in (0,1]$ , satisfying

$\mathbb{E}[\|C(u)-u\|^2] \leq (1-\alpha)\|u\|^2.$

EF21 mitigates limitations of previous heuristics (e.g., EF14 [Seide 2014]), such as pessimistic $O(1/T^{2/3})$ rates and reliance on bounded-gradient conditions, by delivering $O(1/T)$ convergence in the smooth nonconvex regime and supporting strong theoretical guarantees (Fatkhullin et al., 2021).

2. Algorithmic Structure of EF21-SGDM

EF21-SGDM is an adaptation incorporating stochastic gradients and momentum to the baseline EF21 error feedback protocol. Each of $n$ distributed workers maintains local states $\{g_i^t, v_i^t\}$ , where $g_i^t$ is the error-feedback memory and $v_i^t$ is the momentum buffer. At each iteration $t$ :

The master broadcasts the global parameter $x^t$ .
Each worker computes a stochastic gradient estimate $\nabla f_i(x^t, \xi_i^t)$ (with bounded variance $\sigma^2$ ) and updates its local momentum buffer:

$v_i^{t+1} = (1 - \eta)v_i^t + \eta\nabla f_i(x^{t+1}, \xi_i^{t+1}).$

The compression error is calculated:

$c_i^{t+1} = C(v_i^{t+1} - g_i^t), \quad g_i^{t+1} = g_i^t + c_i^{t+1}.$

The master aggregates $g^{t+1} = \frac{1}{n}\sum_{i=1}^n g_i^{t+1}$ .
The global iterate is updated with stepsize $\gamma$ :

$x^{t+1} = x^t - \gamma g^{t+1}.$

Momentum parameter $\eta$ is typically selected adaptively. In practice, Top-K compressors govern $\alpha \approx K/d$ (Fatkhullin et al., 2021).

3. Convergence Analysis and Complexity Bounds

Under standard smoothness ( $L_i$ -Lipschitz) and bounded-variance assumptions, EF21-SGDM achieves the following convergence guarantee in the nonconvex regime:

$\mathbb{E}\|\nabla f(\hat x)\|^2 \leq \frac{L\,\delta_0}{\alpha T} + \frac{(L\delta_0\,\sigma^2)^{1/2}}{T^{1/2}} + \frac{\sigma^2}{nT^{1/2}},$

where $\delta_0 = f(x^0)-f^*$ . To reach $\mathbb{E}\|\nabla f(\hat x)\|^2 \leq \varepsilon^2$ , the required number of iterations is

$T = O(L \alpha^{-1} \varepsilon^{-2} + L \sigma^2 (n^{-1} \varepsilon^{-4})).$

This rate matches the optimal $O(1/T)$ communication complexity of gradient descent but, crucially, does not require large mini-batch sizes ( $B=1$ suffices). In deep learning experiments, heavy-ball momentum ( $\beta \approx 0.9\text{--}0.99$ ) improves empirical generalization without affecting the asymptotic rate (Fatkhullin et al., 2021).

4. Double-Momentum Variant: EF21-SGD2M

EF21-SGD2M generalizes EF21-SGDM by introducing an additional momentum buffer; specifically, each worker maintains $(v_i^t, u_i^t)$ , updated as

$v_i^{t+1} = (1-\eta)v_i^t + \eta\nabla f_i(x^{t+1}, \xi_i^{t+1}), \quad u_i^{t+1} = (1-\eta)u_i^t + \eta v_i^{t+1}.$

Compression and error feedback proceed from $u_i^{t+1}$ . The resulting convergence rate retains the $O(\alpha^{-1} \varepsilon^{-2})$ communication cost and eliminates the suboptimal $\varepsilon^{-3}$ term appearing in previous error feedback methods’ sample complexity bounds (Fatkhullin et al., 2021).

5. Practical Guidelines and Parameter Selection

Table: Recommended EF21-SGDM parameter regimes.

Parameter	Typical Value	Remarks
Compressor $\alpha$	$K/d$ (Top- $K$ )	Governs compression rate
Stepsize $\gamma$	$\approx 1/(2L)$	May be enlarged $(0,2)$
Momentum $\beta$	$0.9$–$0.99$	Lower in noisy settings
Batch size	$B=1$	Large batches not needed

EF21-SGDM is robust to small batch sizes and noisy communication, and does not rely on extra bounded-gradient or similarity assumptions.

6. Quantitative Comparison to Prior Art

Table: Quantitative comparison of error feedback variants under identical smoothness, variance, and Top-K compression (Fatkhullin et al., 2021).

Method	Comm. (Top K)	Sample (per node)	Batch-free	Extra assumptions
EF14-SGD	$O(K\alpha^{-1}\varepsilon^{-3}+K\sigma^2 n^{-1}\varepsilon^{-4})$	$O(\alpha^{-1}\varepsilon^{-3}+ \sigma^2 n^{-1}\varepsilon^{-4})$	Yes	Needs bounded gradients
EF21-SGD	$O(K\alpha^{-1}\varepsilon^{-2})$ (large $B$ )	$O(K\sigma^2\alpha^{-2}n^{-1}\varepsilon^{-4})$	No	No BG/BGS
BEER	$O(K\alpha^{-1}\varepsilon^{-2})$ (large $B$ )	$O(K\sigma^2\alpha^{-1}n^{-1}\varepsilon^{-4})$	No	No BG/BGS
EF21-SGDM	$O(K\varepsilon^{-2})$	$O(K\sigma^2 n^{-1} \varepsilon^{-4})$	Yes	None
EF21-SGD2M	$O(K\varepsilon^{-2})$	$O(K\sigma^2 n^{-1} \varepsilon^{-4})$	Yes	None

EF21-SGDM achieves optimal communication and sample complexities absent the restrictive assumptions and large batch requirements of preceding algorithms.

7. Theoretical Insights and Proof Techniques

The convergence analysis of EF21-SGDM employs a Lyapunov function combining objective gap, momentum buffer variance, and error feedback memory, enabling precise tracking of iterates and error propagation:

$\Psi_t := f(x^t)-f^* + \frac{\gamma}{\eta}\|v^t - \nabla f(x^t)\|^2 + \frac{\gamma\eta}{\alpha^2 n} \sum_i \|v_i^t - \nabla f_i(x^t)\|^2.$

Key technical recurrences establish control over momentum-variance and compression error, facilitating optimal $O(1/T)$ convergence rates in the distributed nonconvex setting (Fatkhullin et al., 2021).

A plausible implication is that these techniques are of independent interest in nonconvex stochastic optimization with momentum, as the analysis remains robust even without incorporating compression. This positions EF21-SGDM and EF21-SGD2M as rigorous foundations for future investigations into communication-efficient distributed learning.

Markdown Report Issue Upgrade to Chat

References (1)

EF21 with Bells & Whistles: Six Algorithmic Extensions of Modern Error Feedback (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EF21-SGDM.

EF21-SGDM: Error Feedback SGD with Momentum

1. Error Feedback Foundations and EF21

2. Algorithmic Structure of EF21-SGDM

3. Convergence Analysis and Complexity Bounds

4. Double-Momentum Variant: EF21-SGD2M

5. Practical Guidelines and Parameter Selection

6. Quantitative Comparison to Prior Art

7. Theoretical Insights and Proof Techniques

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

EF21-SGDM: Error Feedback SGD with Momentum

1. Error Feedback Foundations and EF21

2. Algorithmic Structure of EF21-SGDM

3. Convergence Analysis and Complexity Bounds

4. Double-Momentum Variant: EF21-SGD2M

5. Practical Guidelines and Parameter Selection

6. Quantitative Comparison to Prior Art

7. Theoretical Insights and Proof Techniques

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research