SonicMoE: Hardware-Aware MoE Optimization
- SonicMoE is a hardware-aware optimization suite for Mixture of Experts language models that addresses memory, I/O, and compute bottlenecks on GPU accelerators.
- It employs a memory-efficient activation recomputation algorithm and overlapping GPU kernels to significantly reduce activation memory and improve throughput.
- The suite introduces a novel tile-aware token rounding router that minimizes compute waste in grouped GEMM operations, enhancing overall efficiency.
SonicMoE is a hardware-aware optimization suite for Mixture of Experts (MoE) LLMs designed to address memory, I/O, and compute bottlenecks in fine-grained and highly sparse MoE layers on GPU accelerators. Combining a memory-efficient activation recomputation algorithm, GPU kernels that overlap high-bandwidth memory (HBM) I/O with tensor-core computation, and a novel tile-aware "token rounding" routing scheme, SonicMoE achieves large reductions in activation memory, hardware utilization improvements, and substantial end-to-end throughput gains—all while maintaining or slightly improving downstream model quality. The approach is implemented with open source kernels and demonstrated on Hopper GPUs with state-of-the-art LLMs (Guo et al., 16 Dec 2025).
1. Motivating Challenges in Modern MoE Architectures
Recent MoE models increasingly feature high expert granularity—smaller intermediate expert dimensions () at fixed hidden size ()—and higher sparsity, increasing the total number of experts () and keeping the per-token activated expert count () approximately constant. These design choices favor model quality per floating point operation (FLOP) but lead to several hardware and memory inefficiencies:
- Activation Memory Scalability: The activation memory footprint in a fine-grained MoE grows linearly with expert granularity (); as decreases, retaining all necessary intermediates for the backward pass becomes increasingly costly.
- Memory-Bound Arithmetic Intensity: For SwiGLU-based MoE layers, the arithmetic intensity (AI, FLOPs/byte) sharply drops with increasing granularity or with reduced expert activation fraction (), moving the computation into a memory bandwidth–bound regime.
- Tile Quantization Overhead in Grouped GEMM: Under high sparsity, the number of tokens per expert is often well below the GPU hardware-optimized matrix tile size (), causing substantial wasted compute due to padding in grouped GEMM kernel dispatch.
These inefficiencies are not addressed by algorithmic or model-level innovations alone but demand architectural and kernel-level reengineering.
2. Memory-Efficient Forward and Backward Algorithm
SonicMoE reduces per-layer activation memory consumption to a constant in expert intermediate size by recomputing necessary quantities during backward without incurring additional large GEMM operations.
Standard vs SonicMoE Activation Caching:
- Naïve (ScatterMoE-style) caching stores per expert, leading to bytes for tokens and experts activated per token.
- SonicMoE caches only (), ($2T K n$), and routing metadata, plus their gradients, resulting in a memory requirement of bytes and yielding a 45% reduction for a 7B MoE with .
Forward/Backward Pass Pseudocode:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
for expert e: X_e ← gather(X, π[:, e]) H_e ← X_e @ W1_e A_e ← SwiGLU(H_e) Y_e ← A_e @ W2_e O = sum over e of π[:, e] * S[:, e] * Y_e cache X, H, π, S return O for expert e: dO_e ← gather(dO, π[:, e]) dA'_e = dO_e @ W2_e.T A_e ← SwiGLU_forward(H_e) dH_e ← dSwiGLU(dA'_e, H_e) dS[:, e] = dot(dA'_e, A_e) A'_e = broadcast(S[:, e], A_e) |
This approach trades minimal recomputation for gross memory reduction, and ensures backward pass memory is constant in .
3. IO–Computation Overlap and GPU Kernel Design
SonicMoE's GPU kernels are engineered to maximize utilization by overlapping memory I/O with compute using several concurrency and kernel fusion techniques:
- Gather Fusion in Grouped GEMM Prologue: The initial indexing and gathering stage is fused into the
cp.asyncHBM load such that token data is packed in GPU shared memory (SMEM) in the correct order for matrix multiplication, efficiently staged by four producer warpgroups. - Epilogue Fusion for SwiGLU and dSwiGLU: Both forward SwiGLU activation and its backward pass are fused into GEMM epilogues, obviating separate round-trips between HBM and SMEM for non-matrix activations.
- Ping-Pong Scheduling: On Hopper architecture, each GEMM kernel is managed by two consumer warpgroups in an alternating ping-pong cycle, with one executing the MMA main loop while the other handles epilogue I/O. This design realizes 20–40% TFLOPS recovery on memory-intensive workloads.
Throughput Metrics (7B MoE, ):
| Kernel | ScatterMoE | MoMoE | DeepGEMM++ | SonicMoE |
|---|---|---|---|---|
| Forward up | 380 | 365 | 420 | 450 |
| Forward down | 290 | 275 | 340 | 485 |
| Backward | 185 | 150 | 220 | 328 |
| Backward | 270 | 240 | 300 | 355 |
SonicMoE achieves up to +17% forward and +50% backward throughput gains over state-of-the-art kernels.
4. Tile-Aware Token Rounding Router
To avoid waste in grouped GEMM tile quantization, SonicMoE introduces a two-stage "token rounding" router that guarantees each expert receives a token count divisible by the hardware tile size, minimizing padding overhead.
Token Rounding Algorithm Outline:
- For each token, select top- expert scores.
- Compute expert token counts and round to nearest or balanced multiple of .
- For each expert, retain the first highest-scoring tokens.
- Construct a new sparse score matrix enforcing tile-aligned counts.
Pseudocode:
1 2 3 4 5 6 7 |
Input: S∈ℝ^{T×E}, K, tile 1. [S_topK, I_topK] = TopK(S, K) 2. f_e = count of tokens assigned to e 3. round[f_e] = floor(f_e/tile)*tile or ceil(f_e/tile)*tile 4. For each e, sort S[:, e] ⇨ π_e 5. Retain first round[f_e] tokens in π_e Output: S', π' for GEMM dispatch |
This routing strategy produces at most rounding error per expert and experimentally yields up to 1.16× (forward) kernel speedup at maximal sparsity (), with gains of 1.20×/1.08× in Qwen3-80B class models.
5. Experimental Evaluation
SonicMoE's performance was evaluated on models ranging from 1.4B to 120B parameters with SwiGLU MoE layers, using Hopper H100 GPUs and the lm-engine codebase.
Per-Layer Activation Memory (GiB):
| Model | ScatterMoE | MoMoE | SonicMoE |
|---|---|---|---|
| 1.4B | 1.9 | 2.2 | 1.2 |
| 7B | 2.6 | 3.0 | 1.4 |
| 30B | 4.2 | 4.8 | 2.5 |
| 120B | 6.0 | 7.3 | 3.1 |
End-to-End Training Throughput (7B MoE):
- ScatterMoE (96 H100): 225B tokens/day
- SonicMoE (64 H100): 213B tokens/day (1.86× compute throughput per GPU relative to ScatterMoE's BF16)
Downstream Quality Metrics:
| Method | Val PPL | Avg Acc |
|---|---|---|
| TC top-K | 15.94 | 49.6% |
| TR nearest | 15.92 | 50.0% |
| EC (ft→TC) | 16.98 | 49.3% |
Token rounding matches or exceeds top-K routing in perplexity and accuracy with no degradation in model performance for .
6. Architectural Generality and Broader Impacts
SonicMoE’s backward memory-saving and I/O-overlap kernels are applicable to any tiled sparse-compute architecture, such as block-sparse transformers and future GPU architectures leveraging large SMEM or TMEM (e.g., Blackwell). The token-rounding router is a direct substitute for top-K routing in MoE training and can potentially support hash-based or locality-sensitive expert selection at inference. No adverse effects on statistical efficiency or model quality are observed.
Planned extensions include expert parallelism across nodes with overlapping all-to-all and GEMM computation, FP8 support, and further cache optimizations for deep MoE architectures. A plausible implication is that as model sparsity increases or hardware tiling constraints become stricter, tile-aware routing and memory-efficient backward algorithms will be increasingly essential for scalable MoE deployment (Guo et al., 16 Dec 2025).