Papers
Topics
Authors
Recent
Search
2000 character limit reached

SonicMoE: Hardware-Aware MoE Optimization

Updated 19 December 2025
  • SonicMoE is a hardware-aware optimization suite for Mixture of Experts language models that addresses memory, I/O, and compute bottlenecks on GPU accelerators.
  • It employs a memory-efficient activation recomputation algorithm and overlapping GPU kernels to significantly reduce activation memory and improve throughput.
  • The suite introduces a novel tile-aware token rounding router that minimizes compute waste in grouped GEMM operations, enhancing overall efficiency.

SonicMoE is a hardware-aware optimization suite for Mixture of Experts (MoE) LLMs designed to address memory, I/O, and compute bottlenecks in fine-grained and highly sparse MoE layers on GPU accelerators. Combining a memory-efficient activation recomputation algorithm, GPU kernels that overlap high-bandwidth memory (HBM) I/O with tensor-core computation, and a novel tile-aware "token rounding" routing scheme, SonicMoE achieves large reductions in activation memory, hardware utilization improvements, and substantial end-to-end throughput gains—all while maintaining or slightly improving downstream model quality. The approach is implemented with open source kernels and demonstrated on Hopper GPUs with state-of-the-art LLMs (Guo et al., 16 Dec 2025).

1. Motivating Challenges in Modern MoE Architectures

Recent MoE models increasingly feature high expert granularity—smaller intermediate expert dimensions (nn) at fixed hidden size (dd)—and higher sparsity, increasing the total number of experts (EE) and keeping the per-token activated expert count (KK) approximately constant. These design choices favor model quality per floating point operation (FLOP) but lead to several hardware and memory inefficiencies:

  • Activation Memory Scalability: The activation memory footprint in a fine-grained MoE grows linearly with expert granularity (G=d/nG = d/n); as nn decreases, retaining all necessary intermediates for the backward pass becomes increasingly costly.
  • Memory-Bound Arithmetic Intensity: For SwiGLU-based MoE layers, the arithmetic intensity (AI, FLOPs/byte) sharply drops with increasing granularity or with reduced expert activation fraction (ρ=K/E\rho = K/E), moving the computation into a memory bandwidth–bound regime.
  • Tile Quantization Overhead in Grouped GEMM: Under high sparsity, the number of tokens per expert TeT_e is often well below the GPU hardware-optimized matrix tile size (MtileM_{\rm tile}), causing substantial wasted compute due to padding in grouped GEMM kernel dispatch.

These inefficiencies are not addressed by algorithmic or model-level innovations alone but demand architectural and kernel-level reengineering.

2. Memory-Efficient Forward and Backward Algorithm

SonicMoE reduces per-layer activation memory consumption to a constant in expert intermediate size nn by recomputing necessary quantities during backward without incurring additional large GEMM operations.

Standard vs SonicMoE Activation Caching:

  • Naïve (ScatterMoE-style) caching stores {Y,Xe,He,Ae}\{Y, X_e, H_e, A_e\} per expert, leading to O(TK(d+n))O(T K (d + n)) bytes for TT tokens and KK experts activated per token.
  • SonicMoE caches only XX (TdT d), HH ($2T K n$), and routing metadata, plus their gradients, resulting in a memory requirement of Msonic=2Td+4TKnM_{\rm sonic} = 2T d + 4T K n bytes and yielding a 45% reduction for a 7B MoE with (d,n,K)=(1536,256,8)(d, n, K) = (1536, 256, 8).

Forward/Backward Pass Pseudocode:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
for expert e:
    X_e  gather(X, π[:, e])
    H_e  X_e @ W1_e
    A_e  SwiGLU(H_e)
    Y_e  A_e @ W2_e
O = sum over e of π[:, e] * S[:, e] * Y_e
cache X, H, π, S
return O

for expert e:
    dO_e  gather(dO, π[:, e])
    dA'_e = dO_e @ W2_e.T
    A_e  SwiGLU_forward(H_e)
    dH_e  dSwiGLU(dA'_e, H_e)
    dS[:, e] = dot(dA'_e, A_e)
    A'_e = broadcast(S[:, e], A_e)

This approach trades minimal recomputation for gross memory reduction, and ensures backward pass memory is constant in nn.

3. IO–Computation Overlap and GPU Kernel Design

SonicMoE's GPU kernels are engineered to maximize utilization by overlapping memory I/O with compute using several concurrency and kernel fusion techniques:

  • Gather Fusion in Grouped GEMM Prologue: The initial indexing and gathering stage is fused into the cp.async HBM load such that token data is packed in GPU shared memory (SMEM) in the correct order for matrix multiplication, efficiently staged by four producer warpgroups.
  • Epilogue Fusion for SwiGLU and dSwiGLU: Both forward SwiGLU activation and its backward pass are fused into GEMM epilogues, obviating separate round-trips between HBM and SMEM for non-matrix activations.
  • Ping-Pong Scheduling: On Hopper architecture, each GEMM kernel is managed by two consumer warpgroups in an alternating ping-pong cycle, with one executing the MMA main loop while the other handles epilogue I/O. This design realizes 20–40% TFLOPS recovery on memory-intensive workloads.

Throughput Metrics (7B MoE, d=1536,n=256,K=8,E=128d=1536, n=256, K=8, E=128):

Kernel ScatterMoE MoMoE DeepGEMM++ SonicMoE
Forward up 380 365 420 450
Forward down 290 275 340 485
Backward dHdH 185 150 220 328
Backward dXdX 270 240 300 355

SonicMoE achieves up to +17% forward and +50% backward throughput gains over state-of-the-art kernels.

4. Tile-Aware Token Rounding Router

To avoid waste in grouped GEMM tile quantization, SonicMoE introduces a two-stage "token rounding" router that guarantees each expert receives a token count divisible by the hardware tile size, minimizing padding overhead.

Token Rounding Algorithm Outline:

  1. For each token, select top-KK expert scores.
  2. Compute expert token counts fef_e and round to nearest or balanced multiple of MtileM_{\rm tile}.
  3. For each expert, retain the first round[fe]\text{round}[f_e] highest-scoring tokens.
  4. Construct a new sparse score matrix enforcing tile-aligned counts.

Pseudocode:

1
2
3
4
5
6
7
Input: Sℝ^{T×E}, K, tile
1. [S_topK, I_topK] = TopK(S, K)
2. f_e = count of tokens assigned to e
3. round[f_e] = floor(f_e/tile)*tile or ceil(f_e/tile)*tile
4. For each e, sort S[:, e]  π_e
5. Retain first round[f_e] tokens in π_e
Output: S', π' for GEMM dispatch

This routing strategy produces at most MtileM_{\rm tile} rounding error per expert and experimentally yields up to 1.16× (forward) kernel speedup at maximal sparsity (K/E1/128K/E\approx1/128), with gains of 1.20×/1.08× in Qwen3-80B class models.

5. Experimental Evaluation

SonicMoE's performance was evaluated on models ranging from 1.4B to 120B parameters with SwiGLU MoE layers, using Hopper H100 GPUs and the lm-engine codebase.

Per-Layer Activation Memory (GiB):

Model ScatterMoE MoMoE SonicMoE
1.4B 1.9 2.2 1.2
7B 2.6 3.0 1.4
30B 4.2 4.8 2.5
120B 6.0 7.3 3.1

End-to-End Training Throughput (7B MoE):

  • ScatterMoE (96 ×\times H100): 225B tokens/day
  • SonicMoE (64 ×\times H100): 213B tokens/day (1.86× compute throughput per GPU relative to ScatterMoE's BF16)

Downstream Quality Metrics:

Method Val PPL Avg Acc
TC top-K 15.94 49.6%
TR nearest 15.92 50.0%
EC (ft→TC) 16.98 49.3%

Token rounding matches or exceeds top-K routing in perplexity and accuracy with no degradation in model performance for K/E1/32K/E \leq 1/32.

6. Architectural Generality and Broader Impacts

SonicMoE’s backward memory-saving and I/O-overlap kernels are applicable to any tiled sparse-compute architecture, such as block-sparse transformers and future GPU architectures leveraging large SMEM or TMEM (e.g., Blackwell). The token-rounding router is a direct substitute for top-K routing in MoE training and can potentially support hash-based or locality-sensitive expert selection at inference. No adverse effects on statistical efficiency or model quality are observed.

Planned extensions include expert parallelism across nodes with overlapping all-to-all and GEMM computation, FP8 support, and further cache optimizations for deep MoE architectures. A plausible implication is that as model sparsity increases or hardware tiling constraints become stricter, tile-aware routing and memory-efficient backward algorithms will be increasingly essential for scalable MoE deployment (Guo et al., 16 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SonicMoE.