Sparse-Pertoken MoE Architecture

Updated 9 February 2026

Sparse-Pertoken MoE is a neural architecture that assigns each token to a small subset of experts using dynamic top-k or top-p routing.
It employs a token-level gating mechanism that selectively activates only a few feed-forward networks per token, ensuring efficient computation and balanced load.
This design scales transformer models by decoupling parameter count from per-token FLOPs, enhancing accuracy-efficiency trade-offs while reducing latency.

A Sparse-Pertoken Mixture-of-Experts (MoE) architecture is a neural module in which, for each token in an input sequence, only a small, dynamically selected subset of feed-forward “expert” networks is activated, while the remainder remain idle. This design allows for substantial increases in parameter count and representation capacity while maintaining bounded computational and memory costs per token. The “sparse-per-token” paradigm is realized by a per-token gating (router) network that selects $k \ll N$ out of $N$ experts based on each token’s embedding, supporting adaptability, interpretability, and efficiency. Sparse-per-token MoE sublayers are widely adopted in large-scale transformers across language modeling, vision-language, audio, and recommender domains, and underlie recent state-of-the-art models in both industry and open research.

1. Formal Definition and Core Mechanism

At the heart of a sparse-per-token MoE layer is a token-level routing mechanism that assigns each token $x_t \in \mathbb{R}^d$ to a (typically small) subset $S_t$ of $k$ experts from a global pool of $N$ $(k \ll N)$ . The generic forward computation is: $\text{MoE}(x_t) = \sum_{i \in S_t} g_{t,i} E_i(x_t) + \text{SharedExpert}(x_t)$ where $E_i$ are feed-forward expert networks, $g_{t,i}$ are normalized routing weights, $S_t$ are the per-token selected experts, and an always-active shared expert may be included for stability (Jiang et al., 6 Feb 2026). The gating is realized through

$\ell_t = W_r x_t + b_r,\qquad g_{t} = \mathrm{softmax}(\ell_t),\qquad S_t = \mathrm{topk}(g_t; k)$

and only the $k$ largest $g_{t,i}$ values are retained. Final outputs are weighted sums over active experts plus the shared expert (if present), subject to gate-value scaling for gradient flow (Jiang et al., 6 Feb 2026).

In all recent implementations, this sparse routing occurs per token, per layer, and both during training and inference, enabling conditional and efficient parameter utilization (Jin et al., 16 Dec 2025, Zoph et al., 2022, Zhu et al., 29 Sep 2025, Tastan et al., 5 Feb 2026, Wu et al., 11 Aug 2025, Cai et al., 25 Aug 2025). The architecture generalizes dense FFNs ( $k=N$ ), as well as static mixtures, and supports further enhancements like Top- $p$ routing (Jin et al., 16 Dec 2025), slimmable experts (Tastan et al., 5 Feb 2026), and post-training partitioning (Cai et al., 25 Aug 2025).

2. Routing Strategies and Gating Networks

Sparse-per-token MoE layers employ various token-wise sparse selection policies for expert routing:

Fixed Top- $k$ Routing: Each token is routed to the $k$ experts with highest gating values, yielding exactly $k$ nonzero $g_{t,i}$ per token (Jiang et al., 6 Feb 2026, Zoph et al., 2022, Zhu et al., 29 Sep 2025).
Dynamic Top- $p$ Routing: For each token, select the minimal set of experts such that their softmax probabilities sum to a threshold $p_t$ ; $k$ varies dynamically and budget is enforced via a proportional-integral controller (PI loop) (Jin et al., 16 Dec 2025).
Slice-level and Similarity-aware Variants: Slicing the token embedding and routing each slice to separate experts offers finer granularity (Vejendla, 5 Oct 2025), while incorporating token similarity/attention for more stable, context-aware routing further reduces volatility (Nguyen et al., 1 May 2025).
Explicit Shared Expert: One always-on expert is present to prevent gradient starvation and stabilize updates (Jiang et al., 6 Feb 2026).

Routing is always realized as a per-token forward pass through a small MLP or linear layer, mapping $x_t$ to routing logits, then a sparsifying selection (top- $k$ , top- $p$ , or slice-level), typically followed by softmax normalization over the selected indices. Auxiliary load-balance losses ensuring uniform expert utilization remain common but are sometimes unnecessary when using mechanisms like a shared expert and gate scaling (Jiang et al., 6 Feb 2026, Wu et al., 11 Aug 2025).

3. Expert Architectures and Parameter Partitioning

The “expert” modules in a sparse-per-token MoE are typically shallow feed-forward networks (FFNs), implemented as two-layer MLPs with nonlinearities such as SwiGLU, GEGLU, or SiLU (Zhu et al., 29 Sep 2025, Zoph et al., 2022, Vejendla, 5 Oct 2025). Architectures support several extensions:

Fine-grained Expert Partitioning: “First enlarge then sparse” splits a dense, high-capacity FFN into $E$ fine-grained experts, each assigned to $n D / E$ hidden dimension, promoting both efficiency and expressivity (Jiang et al., 6 Feb 2026).
Heterogeneous/Adjugate Experts: Groups of experts of different sizes (e.g., “big” and “little” adjugate experts) are used for dynamic, token-adaptive capacity allocation (Wu et al., 11 Aug 2025).
Slimmable Experts: Each expert supports evaluation at multiple widths via parameter slicing (nested sub-networks), allowing flexible accuracy-compute trade-offs at inference time (Tastan et al., 5 Feb 2026).
Post-Training Partition and Neuron-Level Reconstruction: Existing experts are post hoc partitioned along the intermediate dimension, and important neurons identified for further computation dropping and reconstruction (Cai et al., 25 Aug 2025).

A shared expert—a full-capacity feedforward module always run for every token—is deployed in some designs to ensure every token has a nonzero update path, simplifying training and removing the need for explicit balancing losses (Jiang et al., 6 Feb 2026).

4. Computational Efficiency and Inference Properties

Sparse-per-token MoE architectures are defined by their ability to scale parameter count while keeping per-token FLOPs and memory bounded:

FLOP Savings: If the base FFN has $n D \times D$ parameters, and the MoE splits this among $E$ experts with $k$ active per token, then per-token FLOPs are reduced to $k/E$ of the dense baseline (Jiang et al., 6 Feb 2026, Jin et al., 16 Dec 2025).
Latency and Memory: Running only a small subset of experts per token during inference results in substantial GPU memory savings and latency reductions—by 75–80% and proportional time, in large-scale models (Zhu et al., 29 Sep 2025).
Load Balancing: Routing and activation histograms show that with careful design—either via balancing losses (Zoph et al., 2022), bias updates (Wu et al., 11 Aug 2025), or a shared expert (Jiang et al., 6 Feb 2026)—experts are utilized uniformly, avoiding underused (cold) experts.
Structured Sparsity and Hardware Co-Design: Dual-sided structured sparsity (on both weights and routed activations)—combined with hardware-aware data layouts and optimized kernels—maximizes achievable speedup on modern accelerators (e.g., NVIDIA SpTCs), supporting severalfold increase in effective batch sizes and 1.5–2× speedups in realistic workloads (Wu et al., 13 Mar 2025).

5. Theoretical and Empirical Advantages

Sparse-per-token MoE models deliver several key advantages:

Scalability: By decoupling parameter count from per-token computation, models up to hundreds of billions of parameters can be trained and deployed while fitting within the runtime/memory envelope of much smaller dense models (Zoph et al., 2022).
Efficiency–Accuracy Trade-offs: Sophisticated designs (slimmable experts, adjugate groups, per-token dynamic routing) enable a dense spectrum of accuracy/compute trade-offs, supporting both fixed and flexible inference budgets (Tastan et al., 5 Feb 2026, Wu et al., 11 Aug 2025).
Gradient Flow and Robustness: Gate-value scaling, shared experts, and context-aware routing improve gradient propagation and mitigate expert starvation or collapse without auxiliary balancing losses (Jiang et al., 6 Feb 2026, Wu et al., 11 Aug 2025, Lei et al., 6 Jan 2026).
Dynamic Capacity Allocation: Stratified and grouping-based designs allow per-token or per-task adaptation of capacity, allocating more compute to “hard” tokens and less to “easy” ones (Xu et al., 2023, Wu et al., 11 Aug 2025).
Computation Dropping: Partition-and-drop and dual-threshold gating reduce active expert/inference computation per token by ≈25%–50% with negligible (≤0.3%) accuracy drop (Cai et al., 25 Aug 2025).

Empirical results show state-of-the-art performance across NLP, multimodal, diffusion, and recommender pipelines, with performance at or above dense baselines at a fraction of the computational footprint (Jin et al., 16 Dec 2025, Jiang et al., 6 Feb 2026, Zhu et al., 29 Sep 2025).

6. Applications, Generalizations, and Limitations

Sparse-per-token MoE layers are deployed in a range of settings:

LLMs and Diffusion Models: LLaDA-MoE, Mixtral, and others demonstrate the method’s ability to match or surpass dense models with an order-of-magnitude fewer active parameters per token (Zhu et al., 29 Sep 2025, Jin et al., 16 Dec 2025).
Industrial Recommendation Systems: TokenMixer-Large with SP-MoE layers delivers 7–15 billion parameter models at tractable inference cost in online deployments (Jiang et al., 6 Feb 2026).
Multimodal and Audio Models: MoE-Adapters are used to disentangle heterogeneous audio modalities and reduce gradient conflicts (Lei et al., 6 Jan 2026).
Vision-LLMs, Translation, and Classification: Grouped, stratified, and slice-based MoEs present robustness and specialization advantages (Vejendla, 5 Oct 2025, Xu et al., 2023).
Structured Sparsity and Hardware-efficient MoE: Systems such as Samoyeds enable dual-sparsity acceleration, supporting efficient large-batch execution on SpTC hardware with minimal loss (Wu et al., 13 Mar 2025).
Efficiency-Focused Inference and Pruning: DualSparse-MoE allows per-token, runtime computation dropping and load-aware thresholding without retraining (Cai et al., 25 Aug 2025).

Limitations include increased system complexity (operator fusion, expert dispatch, kernel management), need for robust expert balancing at large scale, and dependence on hardware/software support for efficient sparse computation. Theoretical analysis reveals possible trade-offs if expert allocation becomes highly uneven or information flow through shared experts is suboptimal in some domains.

7. Summary Table: Key Sparse-PerToken MoE Architectures

Model	Routing Strategy	Key Features/Improvements	Reference
TokenMixer-Large	Top- $k$ per token	Shared expert, gate-value scaling, no bal. loss	(Jiang et al., 6 Feb 2026)
DTop-p MoE	Dynamic Top- $p$ + PI	PI control for FLOP budget, layerwise normalization	(Jin et al., 16 Dec 2025)
GroveMoE	Group-wise, adjugate	Heterogeneous experts, dynamic capacity	(Wu et al., 11 Aug 2025)
SliceMoE	Per-slice Top- $k$	Slices token, finer granularity, balanced expertise	(Vejendla, 5 Oct 2025)
DualSparse-MoE	Gating + thresholding	Post-training partition/major-minor, computation drop	(Cai et al., 25 Aug 2025)
MoSE	Top- $k$ + slimmable	Variable width expert execution at inference	(Tastan et al., 5 Feb 2026)
ST-MoE	Top- $k$ , stable gating	Large scale, auxiliary losses, span tasks	(Zoph et al., 2022)
LLaDA-MoE	Top- $k$ per token	Masked diffusion, large language diffusion models	(Zhu et al., 29 Sep 2025)

Sparse-per-token MoE architectures constitute the state-of-the-art for scalable, efficient, and adaptive neural sequence modeling, with a diverse research landscape focused on stability, flexibility, deployment efficiency, and theoretical understanding. All major advances referenced above construct MoE layers in which, for each token and each layer, only a sparse, dynamically-determined expert subset is activated, yielding high expressivity under strict resource constraints.