ClusterFusion++: Expanding Cluster-Level Fusion to Full Transformer-Block Decoding

Published 26 Apr 2026 in cs.DC | (2604.23553v1)

Abstract: LLM decoding is latency-sensitive and often bottlenecked by fragmented operator execution and repeated off-chip materialization of intermediate tensors. Prior work expands fusion scope by leveraging thread-block clusters and on-chip inter-block collectives to fuse attention-side operators such as QKV projection, attention, and output projection. We develop ClusterFusion++, a CUDA-level extension that broadens fusion to the full Transformer decoder block for GPT-NeoX/Pythia models: LayerNorm -> QKV -> RoPE -> decode attention -> output projection -> Post-LN -> MLP -> residual. We additionally engineer a CUDA-Graph-compatible execution mode with persistent Tensor Memory Accelerator (TMA) descriptors to reduce per-step overhead. On an NVIDIA RTX 5090-class GPU, ClusterFusion++ improves throughput by 1.34x for Pythia-2.8B and yields similar gains for Pythia-6.9B, while maintaining high output fidelity (near-token-identical generation, with minor non-determinism from FP16 atomics).

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces full Transformer-block operator fusion using DSMEM, significantly reducing decoding latency in LLMs.
It details kernel optimizations, including single-pass LayerNorm and tree reduction, to minimize memory traffic and kernel launch overhead.
Experimental results on NVIDIA RTX 5090 show up to 1.34× TPOT speedup with negligible loss in output fidelity.

ClusterFusion++: Enabling Full Transformer-Block Operator Fusion for Efficient LLM Decoding

Introduction

LLM deployment is heavily constrained by autoregressive decoding latency, which increases with larger model sizes and longer context lengths. Modern decoding workflows execute the multiple operators within a Transformer block as fragmented GPU kernels, necessitating intermediate tensor storage in off-chip (global) memory. This fragmented operator scheduling introduces significant global memory traffic and kernel launch overhead, ultimately constraining throughput and increasing per-token latency—factors that remain critical even with state-of-the-art hardware.

Recent architectural advances in NVIDIA GPUs, particularly thread-block clusters with distributed shared memory (DSMEM), have opened new avenues for on-chip inter-block collectives, as exemplified by the original ClusterFusion framework, which fused the attention-side operators for LLM inference. The work, "ClusterFusion++: Expanding Cluster-Level Fusion to Full Transformer-Block Decoding" (2604.23553), systematically extends this paradigm to perform operator fusion across the entire Transformer decoder block in GPT-NeoX/Pythia models, unifying LayerNorm, QKV projection, RoPE, multi-head attention, projections, MLP, and residual pathways. This extension is backed with architectural adaptation, CUDA Graph compatibility, and detailed kernel-level optimizations—substantially reducing overall decoding step latency with only negligible effects on output fidelity.

Methodology and System Design

Full-Block Fusion

ClusterFusion++ broadens the scope of prior cluster-centric fusion beyond just the attention pipeline. A single CUDA kernel, orchestrated over thread-block clusters, now processes every operation of the Transformer decoder block sequentially and in-register/on-chip, encompassing:

Pre-attention LayerNorm
QKV projection, along with memory-efficient KV cache management
Rotary Positional Embedding (RoPE), including model-specific partial rotation (as in Pythia-2.8B)
Scaled dot-product attention, fused with on-chip key-value accumulation
Output projection and residual update
Post-attention LayerNorm
Full MLP, including up-projection, activation (via PTX-accelerated GELU), down-projection, and residual

This comprehensive fusion exploits DSMEM-based collective primitives, circumventing the block-barrier limitation of prior on-chip fusion, and eliminates the need for intermediate tensor materialization in global memory between any stage of the block.

Architecture-Specific Adaptations

The kernel is shaped by GPT-NeoX/Pythia architectural details: weight matrices stored per-head-interleaved, the partial RoPE application unique to Pythia-2.8B (only the first 25% of each head dimension), and addition of bias at multiple layers. Special tiling and warp selection are employed to accommodate non-power-of-two dimensions such as $d_\text{head}=80$ .

Kernel and Execution Optimizations

Key kernel-level innovations include:

Single-pass LayerNorm: Combines mean and variance computation using $\mathrm{Var}(x)=\mathbb{E}[x^2]-\mathbb{E}[x]^2$ , halving memory traffic compared to two-pass implementations.
Cluster-Cooperative Attention: Blocks in a cluster partition the sequence length, exchanging partial results via DSMEM for efficient on-chip reduction.
Tree Reduction: Replaces sequential $O(n)$ ring reductions with tree-structured $O(\log n)$ reductions for output accumulation across blocks.
PTX-Intrinsic GELU: Leverages hand-optimized inline PTX for efficient nonlinearity computation.

To address kernel-launch and buffer-allocation overheads, ClusterFusion++ implements a persistent CUDA Graph mode. All layer-level context descriptors and buffer allocations are constructed once and reused over decode steps, allowing each autoregressive decode iteration to be launched as a graph replay, which further slashes per-step overhead.

Experimental Results

Evaluation on an NVIDIA RTX 5090 demonstrates substantial improvements in both throughput and time-per-output-token (TPOT), across both Pythia-2.8B and Pythia-6.9B models.

ClusterFusion++ delivers consistent speedups over the HuggingFace Transformers baseline with KV caching enabled:

Pythia-2.8B: TPOT speedups from 1.21× up to 1.34× across sequence lengths of 16 to 2048.
Pythia-6.9B: Similar speedups from 1.19× to 1.34×.

These speedups are accompanied by equivalently improved throughput, as shown below.

Figure 1: Time per output token (TPOT) for Pythia-2.8B (left) and Pythia-6.9B (right) on the RTX 5090 GPU, demonstrating up to 1.34× speedup with ClusterFusion++ compared to the baseline.

Output fidelity is preserved at a near-token-identical level (≈99.4–99.8% token match rate on WikiText-2), with only negligible non-deterministic differences, primarily attributed to the use of FP16 atomics during intra-cluster reductions. Perplexity evaluations remain unchanged, as the modifications exclusively affect the decode phase.

Ablation studies reveal important synergy: fusing an MLP kernel alone is suboptimal (0.75× PyTorch baseline), yet, when fused with attention and MLP-up, the throughput exceeds the attention-only path due to register reuse and amortized memory traffic/self-synchronization overhead. Hence, cross-operator kernel fusion yields amplified performance benefits compared to the sum of its isolated components.

Discussion and Implications

ClusterFusion++ establishes the technical feasibility and practicality of block-wide CUDA kernel fusion using DSMEM collectives. The demonstrated speedups, especially on top-of-the-line hardware, have immediate relevance in reducing inference latency for deployed LLM-serving systems, which translates to higher token throughput, lower time-to-first-token, and potentially reduced costs at scale.

The persisting near-deterministic output and unchanged perplexity metrics indicate that these system-level optimizations can be introduced into production settings without model retraining or fine-tuning. Furthermore, the architectural adaptability strategies outlined (e.g., non-power-of-two $d_\text{head}$ , custom RoPE) are likely to generalize to other open-source and proprietary GPT variants.

Practically, the graph-based execution model developed here is orthogonal and complementary to operator fusion, suggesting that future work might further optimize autoregressive decoding by combining graph-level scheduling with advanced memory and operator fusion primitives. Additionally, the reduction of global memory bandwidth use and kernel dispatches can mitigate scaling bottlenecks seen when serving LLMs in high-concurrency contexts (e.g., via multi-process or multi-stream inference backends).

On a theoretical level, these system-level advances prompt reconsideration of optimal operator granularity for on-device neural network execution in scenarios with rapidly converging hardware/software interfaces.

Looking ahead, future research and engineering efforts may target:

Extending full-block fusion to Transformer encoder-decoder architectures and vision Transformers.
Hardware-aware auto-tuning for further performance gains on variable architectures and context lengths.
Integration with quantization, sparsity, and low-rank adaptation techniques.
Moving from PyTorch/HuggingFace baselines to further optimized, production-grade LLM inference frameworks with native cluster-centric kernels.

Conclusion

ClusterFusion++ delivers a comprehensive system-level contribution in efficient LLM decoding, demonstrating that full Transformer-block CUDA fusion—enabled by recent hardware features—yields robust throughput gains, reduced per-token latency, and maintains output fidelity for large-scale LLMs on modern NVIDIA GPUs. This approach is poised for further extension and integration into high-performance, low-latency inference deployments in real-world AI applications.

Markdown Report Issue