Backward-pass memory analysis for the parallelized blocks
Develop a rigorous memory analysis for the backward pass corresponding to the layers analyzed in the paper—Grouped-Query Attention (GQA), feed-forward network (MLP), and Mamba-2 blocks—under the considered parallelization strategies (tensor parallelism, context parallelism, and data parallelism), extending the provided forward-pass analysis to quantify per-device memory usage during backpropagation.
References
Memory analysis for the backward pass is left for future work.
— Distributed Hybrid Parallelism for Large Language Models: Comparative Study and System Design Guide
(2602.09109 - Amer et al., 9 Feb 2026) in Section 5, Parallel Strategies Theoretical Analysis (opening paragraph)