EF21-SGDM: Error Feedback SGD with Momentum
- The paper introduces EF21-SGDM, a distributed optimization algorithm that combines error feedback, stochastic gradients, and Polyak momentum to achieve optimal communication and sample complexity in nonconvex settings.
- The methodology employs error feedback memory, momentum buffers, and contractive compressors to overcome limitations of prior approaches without relying on large batch sizes or bounded gradients.
- Theoretical and empirical analyses demonstrate that EF21-SGDM attains O(1/T) convergence rates and relaxes restrictive assumptions, paving the way for more efficient distributed deep learning.
EF21-SGDM is a distributed optimization algorithm that combines advanced error feedback (EF21), stochastic gradient descent (SGD), and Polyak momentum, introducing theoretical and practical improvements in communication- and sample-efficient training of machine learning models under contractive-compressor regimes. The method relaxes strong assumptions prevalent in previous error-feedback approaches and achieves optimal communication and sample complexity in nonconvex settings, notably without requiring large batch sizes or bounded gradient dissimilarity conditions (Fatkhullin et al., 2021).
1. Error Feedback Foundations and EF21
Error feedback has emerged as a key approach enabling the convergence of distributed gradient methods under lossy (contractive) communication schemes. The EF21 mechanism, introduced by arik et al. (2021), leverages a Markov compressor induced by a contractive compressor with parameter , satisfying
EF21 mitigates limitations of previous heuristics (e.g., EF14 [Seide 2014]), such as pessimistic rates and reliance on bounded-gradient conditions, by delivering convergence in the smooth nonconvex regime and supporting strong theoretical guarantees (Fatkhullin et al., 2021).
2. Algorithmic Structure of EF21-SGDM
EF21-SGDM is an adaptation incorporating stochastic gradients and momentum to the baseline EF21 error feedback protocol. Each of distributed workers maintains local states , where is the error-feedback memory and is the momentum buffer. At each iteration :
- The master broadcasts the global parameter .
- Each worker computes a stochastic gradient estimate (with bounded variance ) and updates its local momentum buffer:
- The compression error is calculated:
- The master aggregates .
- The global iterate is updated with stepsize :
Momentum parameter is typically selected adaptively. In practice, Top-K compressors govern (Fatkhullin et al., 2021).
3. Convergence Analysis and Complexity Bounds
Under standard smoothness (-Lipschitz) and bounded-variance assumptions, EF21-SGDM achieves the following convergence guarantee in the nonconvex regime:
where . To reach , the required number of iterations is
This rate matches the optimal communication complexity of gradient descent but, crucially, does not require large mini-batch sizes ( suffices). In deep learning experiments, heavy-ball momentum () improves empirical generalization without affecting the asymptotic rate (Fatkhullin et al., 2021).
4. Double-Momentum Variant: EF21-SGD2M
EF21-SGD2M generalizes EF21-SGDM by introducing an additional momentum buffer; specifically, each worker maintains , updated as
Compression and error feedback proceed from . The resulting convergence rate retains the communication cost and eliminates the suboptimal term appearing in previous error feedback methods’ sample complexity bounds (Fatkhullin et al., 2021).
5. Practical Guidelines and Parameter Selection
Table: Recommended EF21-SGDM parameter regimes.
| Parameter | Typical Value | Remarks |
|---|---|---|
| Compressor | (Top-) | Governs compression rate |
| Stepsize | May be enlarged | |
| Momentum | $0.9$–$0.99$ | Lower in noisy settings |
| Batch size | Large batches not needed |
EF21-SGDM is robust to small batch sizes and noisy communication, and does not rely on extra bounded-gradient or similarity assumptions.
6. Quantitative Comparison to Prior Art
Table: Quantitative comparison of error feedback variants under identical smoothness, variance, and Top-K compression (Fatkhullin et al., 2021).
| Method | Comm. (Top K) | Sample (per node) | Batch-free | Extra assumptions |
|---|---|---|---|---|
| EF14-SGD | Yes | Needs bounded gradients | ||
| EF21-SGD | (large ) | No | No BG/BGS | |
| BEER | (large ) | No | No BG/BGS | |
| EF21-SGDM | Yes | None | ||
| EF21-SGD2M | Yes | None |
EF21-SGDM achieves optimal communication and sample complexities absent the restrictive assumptions and large batch requirements of preceding algorithms.
7. Theoretical Insights and Proof Techniques
The convergence analysis of EF21-SGDM employs a Lyapunov function combining objective gap, momentum buffer variance, and error feedback memory, enabling precise tracking of iterates and error propagation:
Key technical recurrences establish control over momentum-variance and compression error, facilitating optimal convergence rates in the distributed nonconvex setting (Fatkhullin et al., 2021).
A plausible implication is that these techniques are of independent interest in nonconvex stochastic optimization with momentum, as the analysis remains robust even without incorporating compression. This positions EF21-SGDM and EF21-SGD2M as rigorous foundations for future investigations into communication-efficient distributed learning.