Ring Attention: Blockwise Transformers

Updated 17 February 2026

The paper introduces a ring-based attention mechanism that partitions sequences across devices, enabling exact self-attention without approximations.
It uses blockwise computation with a bidirectional TokenRing communication pattern to overlap compute and data transfers, thereby reducing latency.
Empirical results show that this method scales context length by orders of magnitude while maintaining high hardware utilization and outperforming traditional attention models.

Ring Attention with Blockwise Transformers enables exact Transformer self-attention and feedforward on sequences that are orders of magnitude longer than the memory limits of conventional architectures, without resorting to approximations or sacrificing compute/communication efficiency. The core principle is partitioning the sequence dimension across devices so that each processes local query blocks and rotates key-value blocks (or queries) in a ring topology. This achieves scalable context length, minimal activation memory, and high hardware utilization. Recent innovations, such as TokenRing, further address the communication bottlenecks inherent in earlier ring attention approaches by employing bidirectional, fully overlapping peer-to-peer data transfers, realizing substantial improvements in throughput and latency (Wang et al., 2024, Liu et al., 2023).

1. Blockwise Attention Formulation and Partitioning

Let a Transformer layer receive inputs $Q, K, V \in \mathbb{R}^{H \times L \times D}$ , where $H$ is the number of attention heads, $L$ is the sequence length, and $D$ is the per-head hidden dimension. The sequence is partitioned into $N$ contiguous blocks of length $L_{\rm block} = L/N$ , each assigned to a rank (device) $j=0,\ldots,N-1$ :

$Q = [Q^{(0)}, Q^{(1)}, ..., Q^{(N-1)}]$ , with $Q^{(j)}\in \mathbb{R}^{H \times L_{\rm block} \times D}$ .
Similarly for $K, V$ .

Each rank $j$ retains $K^{(j)}$ and $V^{(j)}$ locally. Iteratively, $Q$ blocks circulate through the logical ring so that, at each step $i$ , rank $j$ receives $Q_{\rm in} = Q^{((j-i)\bmod N)}$ .

Blockwise attention is computed as $\mathrm{FlashAttn}(Q_{\rm in}, K^{(j)}, V^{(j)})$ , yielding:

$\text{block\_out}^{(i,j)} \in \mathbb{R}^{H \times L_{\rm block} \times D}$
$\text{block\_lse}^{(i,j)} \in \mathbb{R}^{H \times L_{\rm block}}$

Partial results are merged in a numerically stable log-sum-exp manner into global accumulators $\mathrm{Out}^{(i,j)}$ and $\mathrm{Lse}^{(i,j)}$ , using the following update for head $h$ and token $t$ :

$\begin{aligned} \Delta &= \max\bigl({\rm Lse}^{(i-1,j)}_{h,t}, \text{block\_lse}^{(i,j)}_{h,t}\bigr) \ {\rm Lse}^{(i,j)}_{h,t} &= \Delta + \log\bigl(\exp({\rm Lse}^{(i-1,j)}_{h,t}-\Delta) + \exp(\text{block\_lse}^{(i,j)}_{h,t}-\Delta)\bigr) \ {\rm Out}^{(i,j)}_{h,t,d} &= \exp({\rm Lse}^{(i-1,j)}_{h,t} - {\rm Lse}^{(i,j)}_{h,t})~{\rm Out}^{(i-1,j)}_{h,t,d} \ &~~~~~~~~~+ \exp(\text{block\_lse}^{(i,j)}_{h,t} - {\rm Lse}^{(i,j)}_{h,t})~\text{block\_out}^{(i,j)}_{h,t,d} \end{aligned}$

After $N$ steps, each GPU owns the final outputs for a unique sequence segment (Wang et al., 2024).

2. Peer-to-Peer (P2P) Ring Communication Patterns

The underlying communication topology is a logical ring where each rank communicates only with its neighbors. In baseline Ring Attention (Liu et al., 2023), key-value ( $K$ , $V$ ) blocks are rotated clockwise; at each step, a rank computes attention for its fixed $Q$ block against the received ( $K$ , $V$ ) and sends out its local $K,V$ block to the next neighbor.

TokenRing introduces bidirectional communication to achieve near-perfect overlap between compute and communication:

Forward ring: $Q$ blocks are sent clockwise.
Backward ring: $\text{block\_out}$ and $\text{block\_lse}$ results are sent counter-clockwise.

This bidirectional strategy exploits full-duplex links typical in OAM full-mesh, NVLink mesh, NVSwitch, or Huawei Ascend HCCS, enabling both directions to transmit simultaneously, thereby halving the effective per-step communication latency.

Example Pseudocode Structure

A simplified iteration (schematic):

for i in range(N):
    # Forward: send Q to right neighbor, receive Q from left
    # Backward: send previous block_out, block_lse to left, receive from right
    # Compute local blockwise attention
    # Merge results using LogSumExpUpdate
    # Synchronize before next iteration

This concurrent messaging ensures no rank sits idle, and computation on the next block can begin as soon as data is available (Wang et al., 2024).

3. Asymptotic Complexity and Communication Overhead

Let $N$ = number of GPUs, $B = L/N$ = per-GPU block length, $D$ = head dim, $H = \#$ heads.

Compute per step per GPU: $\mathcal O(H D B^2)$ (blockwise Flash Attention)
Communication per step:
- TokenRing: send $Q$ block forward ( $H B D$ ), $\text{block\_out}$ + $\text{block\_lse}$ backward ( $H(BD + B) \approx H B D$ ).
- Baseline Ring-Transformers: transmit $2 H B D$ ( $K,V$ ) unidirectionally.
Per-step latency:
- TokenRing: $T_{\rm comm}(TR) \approx H B D / \text{BW}_{\rm bidi}$
- Baseline: $T_{\rm comm}(ring) \approx 2 H B D / \text{BW}_{\rm uni}$

Because computation scales as $O(B^2)$ but communication as $O(B)$ , increasing $N$ eventually makes the system communication-bound. TokenRing's bidirectional overlap delays this scaling bottleneck by a factor of 2 (Wang et al., 2024).

Activation Memory Scaling

Scheme	Per-device Memory	Bottleneck
Vanilla	$O(b h s^2)$	Quadratic in $s$
BPT	$O(b s h)$	Linear in $s$
Ring Attention	$6 b c h$ per device	Linear in $c$

This enables effective "infinite context" with memory usage independent of the full sequence length $s = N c$ (Liu et al., 2023).

4. Empirical Results and Performance Benchmarks

On 4×NVIDIA A10, LLaMA2-7B with $H=32$ , $D=128$ , $L=24$ K tokens, TokenRing achieves:

Method	Comm/Step (ms)	Compute/Step (ms)	Speedup
RingAttention (PXB)	7.6	2.5	1×
TokenRing (PXB)	3.5–4.6	—	1.7×
TokenRing (PIX)	3.2–4.0	—	2.0×

Maximum context lengths, measured on various accelerators and models:

Accelerator	Model	Vanilla	MemEff Attn	BPT	Ring Attn
8×A100 NVLink	7B	2K	16K	32K	256K
32×A100 InfiniBand	13B	4K	32K	64K	2048K
TPUv4-1024	65B	4K	8K	16K	4096K

Model FLOPs utilization suffers less than 5 percentage points degradation compared to blockwise partitioning at standard context length, maintaining $>70\%$ utilization in large-scale LLMs with $>256$ K tokens (Liu et al., 2023).

On line retrieval tasks (LLaMA-13B, 512K context) Ring Attention outperforms GPT3.5-16K and Claude-100K by absolute margins up to 50 percentage points at maximal context, indicating that exact attention is retained even at extreme lengths.

5. Load Balancing and Topological Adaptability

Uniform sequence partitioning and zigzag ordering guarantee identical workload per rank and synchronized operation across the ring, avoiding load imbalance. For causal LLMs, the zigzag partition drops head-context pairs with no dependencies, reducing communication further.

TokenRing requires only a full-duplex, non-blocking P2P primitive—commodity interconnects (NVSwitch, OAM mesh, Huawei HCCS) suffice. On heterogenous or partially connected systems, hybrid schemes (TokenRing intra-node + All-to-All or higher-level RingAttention inter-node) preserve algorithmic efficiency (Wang et al., 2024).

6. Context, Generalization, and Practical Implications

Ring Attention and TokenRing extend blockwise Transformers into the regime of "near-infinite context," enabling context lengths scaling as $O(Nc)$ , with exact self-attention semantics, numerically stable accumulation, and minimal per-device activation.

Unlike prior memory-efficient Transformers (linear, local, or approximative attention), this approach retains full attention expressiveness, allows training/inference with millions of tokens, and is broadly compatible with distributed training and inference frameworks (e.g., xDIT, FSDP pipelines). The only prerequisite is sufficient collective bandwidth to maintain compute-communication overlap for the target block size.

TokenRing's bidirectional, fine-grained ring communication pattern generalizes efficiently to clusters with arbitrary GPU counts and interconnects, sustaining high utilization across a diverse set of cloud and on-premise topologies. This design notably achieves substantial reductions in per-step network latency, high throughput, and enables training and inference at contexts unattainable by previous approaches (Wang et al., 2024, Liu et al., 2023).

Markdown Report Issue Upgrade to Chat

References (2)

TokenRing: An Efficient Parallelism Framework for Infinite-Context LLMs via Bidirectional Communication (2024)

Ring Attention with Blockwise Transformers for Near-Infinite Context (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Ring Attention with Blockwise Transformers.

Ring Attention: Blockwise Transformers

1. Blockwise Attention Formulation and Partitioning

2. Peer-to-Peer (P2P) Ring Communication Patterns

Example Pseudocode Structure

3. Asymptotic Complexity and Communication Overhead

Activation Memory Scaling

4. Empirical Results and Performance Benchmarks

5. Load Balancing and Topological Adaptability

6. Context, Generalization, and Practical Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Ring Attention: Blockwise Transformers

1. Blockwise Attention Formulation and Partitioning

2. Peer-to-Peer (P2P) Ring Communication Patterns

Example Pseudocode Structure

3. Asymptotic Complexity and Communication Overhead

Activation Memory Scaling

4. Empirical Results and Performance Benchmarks

5. Load Balancing and Topological Adaptability

6. Context, Generalization, and Practical Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research