Ring Attention: Blockwise Transformers
- The paper introduces a ring-based attention mechanism that partitions sequences across devices, enabling exact self-attention without approximations.
- It uses blockwise computation with a bidirectional TokenRing communication pattern to overlap compute and data transfers, thereby reducing latency.
- Empirical results show that this method scales context length by orders of magnitude while maintaining high hardware utilization and outperforming traditional attention models.
Ring Attention with Blockwise Transformers enables exact Transformer self-attention and feedforward on sequences that are orders of magnitude longer than the memory limits of conventional architectures, without resorting to approximations or sacrificing compute/communication efficiency. The core principle is partitioning the sequence dimension across devices so that each processes local query blocks and rotates key-value blocks (or queries) in a ring topology. This achieves scalable context length, minimal activation memory, and high hardware utilization. Recent innovations, such as TokenRing, further address the communication bottlenecks inherent in earlier ring attention approaches by employing bidirectional, fully overlapping peer-to-peer data transfers, realizing substantial improvements in throughput and latency (Wang et al., 2024, Liu et al., 2023).
1. Blockwise Attention Formulation and Partitioning
Let a Transformer layer receive inputs , where is the number of attention heads, is the sequence length, and is the per-head hidden dimension. The sequence is partitioned into contiguous blocks of length , each assigned to a rank (device) :
- , with .
- Similarly for .
Each rank retains and locally. Iteratively, blocks circulate through the logical ring so that, at each step , rank receives .
Blockwise attention is computed as , yielding:
Partial results are merged in a numerically stable log-sum-exp manner into global accumulators and , using the following update for head and token :
After steps, each GPU owns the final outputs for a unique sequence segment (Wang et al., 2024).
2. Peer-to-Peer (P2P) Ring Communication Patterns
The underlying communication topology is a logical ring where each rank communicates only with its neighbors. In baseline Ring Attention (Liu et al., 2023), key-value (, ) blocks are rotated clockwise; at each step, a rank computes attention for its fixed block against the received (, ) and sends out its local block to the next neighbor.
TokenRing introduces bidirectional communication to achieve near-perfect overlap between compute and communication:
- Forward ring: blocks are sent clockwise.
- Backward ring: and results are sent counter-clockwise.
This bidirectional strategy exploits full-duplex links typical in OAM full-mesh, NVLink mesh, NVSwitch, or Huawei Ascend HCCS, enabling both directions to transmit simultaneously, thereby halving the effective per-step communication latency.
Example Pseudocode Structure
A simplified iteration (schematic):
1 2 3 4 5 6 |
for i in range(N): # Forward: send Q to right neighbor, receive Q from left # Backward: send previous block_out, block_lse to left, receive from right # Compute local blockwise attention # Merge results using LogSumExpUpdate # Synchronize before next iteration |
This concurrent messaging ensures no rank sits idle, and computation on the next block can begin as soon as data is available (Wang et al., 2024).
3. Asymptotic Complexity and Communication Overhead
Let = number of GPUs, = per-GPU block length, = head dim, heads.
- Compute per step per GPU: (blockwise Flash Attention)
- Communication per step:
- TokenRing: send block forward (), + backward ().
- Baseline Ring-Transformers: transmit $2 H B D$ () unidirectionally.
- Per-step latency:
- TokenRing:
- Baseline:
Because computation scales as but communication as , increasing eventually makes the system communication-bound. TokenRing's bidirectional overlap delays this scaling bottleneck by a factor of 2 (Wang et al., 2024).
Activation Memory Scaling
| Scheme | Per-device Memory | Bottleneck |
|---|---|---|
| Vanilla | Quadratic in | |
| BPT | Linear in | |
| Ring Attention | $6 b c h$ per device | Linear in |
This enables effective "infinite context" with memory usage independent of the full sequence length (Liu et al., 2023).
4. Empirical Results and Performance Benchmarks
On 4×NVIDIA A10, LLaMA2-7B with , , K tokens, TokenRing achieves:
| Method | Comm/Step (ms) | Compute/Step (ms) | Speedup |
|---|---|---|---|
| RingAttention (PXB) | 7.6 | 2.5 | 1× |
| TokenRing (PXB) | 3.5–4.6 | — | 1.7× |
| TokenRing (PIX) | 3.2–4.0 | — | 2.0× |
Maximum context lengths, measured on various accelerators and models:
| Accelerator | Model | Vanilla | MemEff Attn | BPT | Ring Attn |
|---|---|---|---|---|---|
| 8×A100 NVLink | 7B | 2K | 16K | 32K | 256K |
| 32×A100 InfiniBand | 13B | 4K | 32K | 64K | 2048K |
| TPUv4-1024 | 65B | 4K | 8K | 16K | 4096K |
Model FLOPs utilization suffers less than 5 percentage points degradation compared to blockwise partitioning at standard context length, maintaining utilization in large-scale LLMs with K tokens (Liu et al., 2023).
On line retrieval tasks (LLaMA-13B, 512K context) Ring Attention outperforms GPT3.5-16K and Claude-100K by absolute margins up to 50 percentage points at maximal context, indicating that exact attention is retained even at extreme lengths.
5. Load Balancing and Topological Adaptability
Uniform sequence partitioning and zigzag ordering guarantee identical workload per rank and synchronized operation across the ring, avoiding load imbalance. For causal LLMs, the zigzag partition drops head-context pairs with no dependencies, reducing communication further.
TokenRing requires only a full-duplex, non-blocking P2P primitive—commodity interconnects (NVSwitch, OAM mesh, Huawei HCCS) suffice. On heterogenous or partially connected systems, hybrid schemes (TokenRing intra-node + All-to-All or higher-level RingAttention inter-node) preserve algorithmic efficiency (Wang et al., 2024).
6. Context, Generalization, and Practical Implications
Ring Attention and TokenRing extend blockwise Transformers into the regime of "near-infinite context," enabling context lengths scaling as , with exact self-attention semantics, numerically stable accumulation, and minimal per-device activation.
Unlike prior memory-efficient Transformers (linear, local, or approximative attention), this approach retains full attention expressiveness, allows training/inference with millions of tokens, and is broadly compatible with distributed training and inference frameworks (e.g., xDIT, FSDP pipelines). The only prerequisite is sufficient collective bandwidth to maintain compute-communication overlap for the target block size.
TokenRing's bidirectional, fine-grained ring communication pattern generalizes efficiently to clusters with arbitrary GPU counts and interconnects, sustaining high utilization across a diverse set of cloud and on-premise topologies. This design notably achieves substantial reductions in per-step network latency, high throughput, and enables training and inference at contexts unattainable by previous approaches (Wang et al., 2024, Liu et al., 2023).