Backward-pass memory analysis for the parallelized blocks

Develop a rigorous memory analysis for the backward pass corresponding to the layers analyzed in the paper—Grouped-Query Attention (GQA), feed-forward network (MLP), and Mamba-2 blocks—under the considered parallelization strategies (tensor parallelism, context parallelism, and data parallelism), extending the provided forward-pass analysis to quantify per-device memory usage during backpropagation.

Background

Section 5 presents unified derivations of FLOPs, memory, and communication for GQA, MLP, and Mamba, focusing on the forward pass (which corresponds to training forward/prefill). The authors explicitly note that the backward-pass memory analysis is not included and is deferred. Extending this analysis is necessary to fully characterize per-device memory during backpropagation under the same parallel strategies.

References

Memory analysis for the backward pass is left for future work.

Distributed Hybrid Parallelism for Large Language Models: Comparative Study and System Design Guide  (2602.09109 - Amer et al., 9 Feb 2026) in Section 5, Parallel Strategies Theoretical Analysis (opening paragraph)