Papers
Topics
Authors
Recent
Search
2000 character limit reached

Ring Attention: Blockwise Transformers

Updated 17 February 2026
  • The paper introduces a ring-based attention mechanism that partitions sequences across devices, enabling exact self-attention without approximations.
  • It uses blockwise computation with a bidirectional TokenRing communication pattern to overlap compute and data transfers, thereby reducing latency.
  • Empirical results show that this method scales context length by orders of magnitude while maintaining high hardware utilization and outperforming traditional attention models.

Ring Attention with Blockwise Transformers enables exact Transformer self-attention and feedforward on sequences that are orders of magnitude longer than the memory limits of conventional architectures, without resorting to approximations or sacrificing compute/communication efficiency. The core principle is partitioning the sequence dimension across devices so that each processes local query blocks and rotates key-value blocks (or queries) in a ring topology. This achieves scalable context length, minimal activation memory, and high hardware utilization. Recent innovations, such as TokenRing, further address the communication bottlenecks inherent in earlier ring attention approaches by employing bidirectional, fully overlapping peer-to-peer data transfers, realizing substantial improvements in throughput and latency (Wang et al., 2024, Liu et al., 2023).

1. Blockwise Attention Formulation and Partitioning

Let a Transformer layer receive inputs Q,K,VRH×L×DQ, K, V \in \mathbb{R}^{H \times L \times D}, where HH is the number of attention heads, LL is the sequence length, and DD is the per-head hidden dimension. The sequence is partitioned into NN contiguous blocks of length Lblock=L/NL_{\rm block} = L/N, each assigned to a rank (device) j=0,,N1j=0,\ldots,N-1:

  • Q=[Q(0),Q(1),...,Q(N1)]Q = [Q^{(0)}, Q^{(1)}, ..., Q^{(N-1)}], with Q(j)RH×Lblock×DQ^{(j)}\in \mathbb{R}^{H \times L_{\rm block} \times D}.
  • Similarly for K,VK, V.

Each rank jj retains K(j)K^{(j)} and V(j)V^{(j)} locally. Iteratively, QQ blocks circulate through the logical ring so that, at each step ii, rank jj receives Qin=Q((ji)modN)Q_{\rm in} = Q^{((j-i)\bmod N)}.

Blockwise attention is computed as FlashAttn(Qin,K(j),V(j))\mathrm{FlashAttn}(Q_{\rm in}, K^{(j)}, V^{(j)}), yielding:

  • block_out(i,j)RH×Lblock×D\text{block\_out}^{(i,j)} \in \mathbb{R}^{H \times L_{\rm block} \times D}
  • block_lse(i,j)RH×Lblock\text{block\_lse}^{(i,j)} \in \mathbb{R}^{H \times L_{\rm block}}

Partial results are merged in a numerically stable log-sum-exp manner into global accumulators Out(i,j)\mathrm{Out}^{(i,j)} and Lse(i,j)\mathrm{Lse}^{(i,j)}, using the following update for head hh and token tt:

Δ=max(Lseh,t(i1,j),block_lseh,t(i,j)) Lseh,t(i,j)=Δ+log(exp(Lseh,t(i1,j)Δ)+exp(block_lseh,t(i,j)Δ)) Outh,t,d(i,j)=exp(Lseh,t(i1,j)Lseh,t(i,j)) Outh,t,d(i1,j)          +exp(block_lseh,t(i,j)Lseh,t(i,j)) block_outh,t,d(i,j)\begin{aligned} \Delta &= \max\bigl({\rm Lse}^{(i-1,j)}_{h,t}, \text{block\_lse}^{(i,j)}_{h,t}\bigr) \ {\rm Lse}^{(i,j)}_{h,t} &= \Delta + \log\bigl(\exp({\rm Lse}^{(i-1,j)}_{h,t}-\Delta) + \exp(\text{block\_lse}^{(i,j)}_{h,t}-\Delta)\bigr) \ {\rm Out}^{(i,j)}_{h,t,d} &= \exp({\rm Lse}^{(i-1,j)}_{h,t} - {\rm Lse}^{(i,j)}_{h,t})~{\rm Out}^{(i-1,j)}_{h,t,d} \ &~~~~~~~~~+ \exp(\text{block\_lse}^{(i,j)}_{h,t} - {\rm Lse}^{(i,j)}_{h,t})~\text{block\_out}^{(i,j)}_{h,t,d} \end{aligned}

After NN steps, each GPU owns the final outputs for a unique sequence segment (Wang et al., 2024).

2. Peer-to-Peer (P2P) Ring Communication Patterns

The underlying communication topology is a logical ring where each rank communicates only with its neighbors. In baseline Ring Attention (Liu et al., 2023), key-value (KK, VV) blocks are rotated clockwise; at each step, a rank computes attention for its fixed QQ block against the received (KK, VV) and sends out its local K,VK,V block to the next neighbor.

TokenRing introduces bidirectional communication to achieve near-perfect overlap between compute and communication:

  • Forward ring: QQ blocks are sent clockwise.
  • Backward ring: block_out\text{block\_out} and block_lse\text{block\_lse} results are sent counter-clockwise.

This bidirectional strategy exploits full-duplex links typical in OAM full-mesh, NVLink mesh, NVSwitch, or Huawei Ascend HCCS, enabling both directions to transmit simultaneously, thereby halving the effective per-step communication latency.

Example Pseudocode Structure

A simplified iteration (schematic):

1
2
3
4
5
6
for i in range(N):
    # Forward: send Q to right neighbor, receive Q from left
    # Backward: send previous block_out, block_lse to left, receive from right
    # Compute local blockwise attention
    # Merge results using LogSumExpUpdate
    # Synchronize before next iteration

This concurrent messaging ensures no rank sits idle, and computation on the next block can begin as soon as data is available (Wang et al., 2024).

3. Asymptotic Complexity and Communication Overhead

Let NN = number of GPUs, B=L/NB = L/N = per-GPU block length, DD = head dim, H=#H = \# heads.

  • Compute per step per GPU: O(HDB2)\mathcal O(H D B^2) (blockwise Flash Attention)
  • Communication per step:
    • TokenRing: send QQ block forward (HBDH B D), block_out\text{block\_out} + block_lse\text{block\_lse} backward (H(BD+B)HBDH(BD + B) \approx H B D).
    • Baseline Ring-Transformers: transmit $2 H B D$ (K,VK,V) unidirectionally.
  • Per-step latency:
    • TokenRing: Tcomm(TR)HBD/BWbidiT_{\rm comm}(TR) \approx H B D / \text{BW}_{\rm bidi}
    • Baseline: Tcomm(ring)2HBD/BWuniT_{\rm comm}(ring) \approx 2 H B D / \text{BW}_{\rm uni}

Because computation scales as O(B2)O(B^2) but communication as O(B)O(B), increasing NN eventually makes the system communication-bound. TokenRing's bidirectional overlap delays this scaling bottleneck by a factor of 2 (Wang et al., 2024).

Activation Memory Scaling

Scheme Per-device Memory Bottleneck
Vanilla O(bhs2)O(b h s^2) Quadratic in ss
BPT O(bsh)O(b s h) Linear in ss
Ring Attention $6 b c h$ per device Linear in cc

This enables effective "infinite context" with memory usage independent of the full sequence length s=Ncs = N c (Liu et al., 2023).

4. Empirical Results and Performance Benchmarks

On 4×NVIDIA A10, LLaMA2-7B with H=32H=32, D=128D=128, L=24L=24K tokens, TokenRing achieves:

Method Comm/Step (ms) Compute/Step (ms) Speedup
RingAttention (PXB) 7.6 2.5
TokenRing (PXB) 3.5–4.6 1.7×
TokenRing (PIX) 3.2–4.0 2.0×

Maximum context lengths, measured on various accelerators and models:

Accelerator Model Vanilla MemEff Attn BPT Ring Attn
8×A100 NVLink 7B 2K 16K 32K 256K
32×A100 InfiniBand 13B 4K 32K 64K 2048K
TPUv4-1024 65B 4K 8K 16K 4096K

Model FLOPs utilization suffers less than 5 percentage points degradation compared to blockwise partitioning at standard context length, maintaining >70%>70\% utilization in large-scale LLMs with >256>256K tokens (Liu et al., 2023).

On line retrieval tasks (LLaMA-13B, 512K context) Ring Attention outperforms GPT3.5-16K and Claude-100K by absolute margins up to 50 percentage points at maximal context, indicating that exact attention is retained even at extreme lengths.

5. Load Balancing and Topological Adaptability

Uniform sequence partitioning and zigzag ordering guarantee identical workload per rank and synchronized operation across the ring, avoiding load imbalance. For causal LLMs, the zigzag partition drops head-context pairs with no dependencies, reducing communication further.

TokenRing requires only a full-duplex, non-blocking P2P primitive—commodity interconnects (NVSwitch, OAM mesh, Huawei HCCS) suffice. On heterogenous or partially connected systems, hybrid schemes (TokenRing intra-node + All-to-All or higher-level RingAttention inter-node) preserve algorithmic efficiency (Wang et al., 2024).

6. Context, Generalization, and Practical Implications

Ring Attention and TokenRing extend blockwise Transformers into the regime of "near-infinite context," enabling context lengths scaling as O(Nc)O(Nc), with exact self-attention semantics, numerically stable accumulation, and minimal per-device activation.

Unlike prior memory-efficient Transformers (linear, local, or approximative attention), this approach retains full attention expressiveness, allows training/inference with millions of tokens, and is broadly compatible with distributed training and inference frameworks (e.g., xDIT, FSDP pipelines). The only prerequisite is sufficient collective bandwidth to maintain compute-communication overlap for the target block size.

TokenRing's bidirectional, fine-grained ring communication pattern generalizes efficiently to clusters with arbitrary GPU counts and interconnects, sustaining high utilization across a diverse set of cloud and on-premise topologies. This design notably achieves substantial reductions in per-step network latency, high throughput, and enables training and inference at contexts unattainable by previous approaches (Wang et al., 2024, Liu et al., 2023).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Ring Attention with Blockwise Transformers.