ZeCO: Zero Communication Overhead Sequence Parallelism for Linear Attention

Published 1 Jul 2025 in cs.LG | (2507.01004v2)

Abstract: Linear attention mechanisms deliver significant advantages for LLMs by providing linear computational complexity, enabling efficient processing of ultra-long sequences (e.g., 1M context). However, existing Sequence Parallelism (SP) methods, essential for distributing these workloads across devices, become the primary bottleneck due to substantial communication overhead. In this paper, we introduce ZeCO (Zero Communication Overhead) sequence parallelism for linear attention models, a new SP method designed to overcome these limitations and achieve end-to-end near-linear scalability for long sequence training. For example, training a model with a 1M sequence length across 64 devices using ZeCO takes roughly the same time as training with an 16k sequence on a single device. At the heart of ZeCO lies All-Scan, a new collective communication primitive. All-Scan provides each SP rank with precisely the initial operator state it requires while maintaining a minimal communication footprint, effectively eliminating communication overhead. Theoretically, we prove the optimaity of ZeCO, showing that it introduces only negligible time and space overhead. Empirically, we compare the communication costs of different sequence parallelism strategies and demonstrate that All-Scan achieves the fastest communication in SP scenarios. Specifically, on 256 GPUs with an 8M sequence length, ZeCO achieves a 60\% speedup compared to the current state-of-the-art (SOTA) SP method. We believe ZeCO establishes a clear path toward efficiently training next-generation LLMs on previously intractable sequence lengths.

Abstract PDF Upgrade to Chat

Summary

The paper’s main contribution is ZeCO’s innovative All-Scan primitive that nearly eliminates communication overhead in linear attention models.
It strategically overlaps communication with computation, reducing communication time by up to 60% on 256 GPUs and improving scalability.
Experimental results show up to 3.9 times faster communication and stable throughput scaling compared to conventional Sequence Parallelism methods.

ZeCO: Zero Communication Overhead Sequence Parallelism for Linear Attention

Introduction

The paper "ZeCO: Zero Communication Overhead Sequence Parallelism for Linear Attention" addresses the inefficiencies in existing Sequence Parallelism (SP) methods for linear attention models, particularly focusing on communication overhead as the primary bottleneck. ZeCO introduces an innovative SP method designed to achieve near-linear scalability in long-sequence training by eliminating communication overhead. The core contribution, All-Scan, strategically minimizes communication latency to enable efficient parallelism with minimal additional computational cost.

Linear Attention and Sequence Parallelism

Linear attention techniques offer computational efficiency by reducing the complexity from $O(L^2)$ to $O(L d^2)$ , making them suitable for processing ultra-long sequences. These models replace the softmax attention mechanism with operations that are linear in sequence length. The core challenge remains distributing these workloads efficiently across devices due to the burdensome communication requirements of existing SP strategies. Standard SP algorithms either face serial execution dependencies or incur substantial communication delays, inhibiting scalable throughput.

ZeCO Methodology

ZeCO addresses these limitations through multiple innovations:

All-Scan Communication Primitive: This new primitive significantly reduces the communication footprint needed for SP. All-Scan ensures that each sequence parallelism rank gets only the necessary initial operator state, effectively eliminating any superfluous communication overhead.
Figure 1: Illustration of ZeCO, demonstrating its strengths in parallel scalability, overlap of computation and communication, and reduced inter-device synchronization.
Parallelism and Optimal Overlap: By overlapping communication with computation, such as the concurrent computation of diagonal attention scores, ZeCO optimizes resource utilization. This approach transforms communication time from a bottleneck into a parallel process that minimizes inactive periods across devices.
Computational and I/O Efficiency: ZeCO restructures the existing Gated Linear Attention (GLA) algorithm to minimize additional computational and I/O costs, maintaining efficient sequence parallel training.

Experimental Results and Analysis

The performance gains of ZeCO are substantiated through extensive experimentation:

Communication Efficiency: In comparison to state-of-the-art SP methods, ZeCO achieves up to 3.9 times faster communication. For instance, on 256 GPUs, ZeCO demonstrates a 60% reduction in communication time versus the current leading SP strategies (Figure 2).
Figure 2: ZeCO has the lowest communication time while maintaining the lowest communication volume.
Scalability: ZeCO demonstrates near-linear scalability, efficiently leveraging from 8 to 256 devices without significant performance degradation. The throughput scales linearly with the addition of more GPUs, unlike other SP methods that suffer from scaling inefficiencies (Figure 3).
Figure 3: ZeCO exhibits stable throughput scaling with performance comparable to data parallelism.
Theoretical Optimality: The optimal design of ZeCO is backed by theoretical analysis, confirming the minimal communication and computational costs necessary for SP. It achieves zero communication overhead, ensuring that the communication volume does not grow with the number of devices, contrary to existing methods like LASP-2 which have linearly increasing communication burdens.

Conclusion

ZeCO sets a new benchmark for sequence parallelism in linear attention models by providing a method that nearly eliminates communication overhead. This efficiency unlocks the potential of linear attention mechanisms in large-scale training scenarios, particularly for ultra-long sequences in LLMs. Future research directions could explore optimizing All-Scan's implementation further or extending ZeCO's approaches to other forms of linear and non-linear attention mechanisms. The fundamental innovations outlined in ZeCO pave the way for more efficient distributed learning frameworks.