OmniMoE: Scalable Sparse Mixture-of-Experts

Updated 19 February 2026

OmniMoE architectures are sparse Mixture-of-Experts neural models that generalize expert definitions with atomic, modality-specific, and coarse-grained experts.
They employ diverse routing strategies, including shared routers and Cartesian product and top-p dynamic gating, to optimize compute efficiency and load balance expert activation.
Designed for large-scale multimodal tasks, OmniMoE integrates system-algorithm co-design and hardware optimizations to achieve significant speedups and scalability.

OmniMoE Architecture refers to a class of sparse Mixture-of-Experts (MoE) neural architectures that generalize and extend traditional MoE frameworks by introducing design innovations for expert granularity, routing coordination, cross-modal unification, and system-algorithm co-design. These architectures are applied in diverse domains, including large autoregressive LLMs, multimodal generative and perception tasks, speech recognition, and scientific ML. Notably, “OmniMoE” also serves as a generic umbrella for a series of concretely named architectures from several independent research lines (e.g., “OmniMoE: An Efficient MoE by Orchestrating Atomic Experts at Scale,” “Uni-MoE-2.0-Omni,” “Ming-Flash-Omni,” “Omni-Router Transformer,” and others), sharing core principles with model- and system-specific specializations (Shi et al., 5 Feb 2026, Li et al., 16 Nov 2025, AI et al., 28 Oct 2025, Gu et al., 8 Jul 2025, Li et al., 2024, Ma et al., 4 Aug 2025, 2502.01074).

1. Granularity of Experts and Core Layer Organization

OmniMoE architectures generalize the expert definition to include both traditional coarse-grained experts (full multi-layer FFNs), modality-specific dense and sparse experts, atomic experts (single vector-pair transforms), and anchor/reconcile expert mechanisms.

Atomic Experts: In the most granular instantiation, each expert is represented as a pair of vectors $w^{\mathrm{in}}_i, w^{\mathrm{out}}_i \in \mathbb{R}^d$ , and the atomic expert applies $E_i(x) = \sigma(x (w^{\mathrm{in}}_i)^\top) w^{\mathrm{out}}_i$ with $\sigma$ as a nonlinearity, such as SwiGLU. This enables scaling the number of experts into the millions for extremely fine specialization (Shi et al., 5 Feb 2026).
Coarse-Grained and Modality Experts: Traditional MoE uses per-layer banks of $N$ full FFN experts, with variants including large routed, small shared, and “null” experts for computational skipping (Li et al., 16 Nov 2025, AI et al., 28 Oct 2025).
Anchor-and-Reconcile Experts: In some MoE designs for scientific/molecular domains, each MoE layer in the decoder comprises $N$ reconcile experts (task-specialized) and a single anchor expert (task-agnostic, always active), with the anchor aggregating the global representation (2502.01074).

Sparsity is induced by routing only a few experts per token (often $k \ll N$ ), reducing both compute and active parameter cost per forward pass. Empirical settings for $N$ (number of experts) and $k$ (number active per token) vary, with $E=128$ and $k=2$ being typical in ultralarge models, while atomic expert models can use $E_i(x) = \sigma(x (w^{\mathrm{in}}_i)^\top) w^{\mathrm{out}}_i$ 0 in the millions with $E_i(x) = \sigma(x (w^{\mathrm{in}}_i)^\top) w^{\mathrm{out}}_i$ 1 (Shi et al., 5 Feb 2026, AI et al., 28 Oct 2025).

2. Routing Mechanisms and Coordination

A defining feature of OmniMoE is the diversity and coordination in routing strategies:

Shared Router Weights: The Omni-Router Transformer ties the router matrices of all MoE layers, so that $E_i(x) = \sigma(x (w^{\mathrm{in}}_i)^\top) w^{\mathrm{out}}_i$ 2 for all $E_i(x) = \sigma(x (w^{\mathrm{in}}_i)^\top) w^{\mathrm{out}}_i$ 3, which enforces inter-layer consistency and enables specialist “pipelines” across depth—a marked difference from per-layer independent routers in Switch Transformers (Gu et al., 8 Jul 2025).
Cartesian Product Routing: Atomically fine OmniMoE variants avoid the $E_i(x) = \sigma(x (w^{\mathrm{in}}_i)^\top) w^{\mathrm{out}}_i$ 4 complexity of standard routing by factorizing the expert index into $E_i(x) = \sigma(x (w^{\mathrm{in}}_i)^\top) w^{\mathrm{out}}_i$ 5 and learning two projections, so that $E_i(x) = \sigma(x (w^{\mathrm{in}}_i)^\top) w^{\mathrm{out}}_i$ 6. The top-k experts are selected from a $E_i(x) = \sigma(x (w^{\mathrm{in}}_i)^\top) w^{\mathrm{out}}_i$ 7 grid, yielding $E_i(x) = \sigma(x (w^{\mathrm{in}}_i)^\top) w^{\mathrm{out}}_i$ 8 complexity (Shi et al., 5 Feb 2026).
Top-P Dynamic Capacity Routing: In advanced multimodal models, each token selects a minimal set of routed experts whose cumulative softmax mass exceeds a fixed threshold $E_i(x) = \sigma(x (w^{\mathrm{in}}_i)^\top) w^{\mathrm{out}}_i$ 9 (“Top-P”); this varies $\sigma$ 0 per token and allows dynamic, data-dependent compute allocation (Li et al., 16 Nov 2025).
Expert Role Masking and Null Experts: Some architectures intermix “null” experts into the routed set so that tokens can explicitly choose to skip computation, controlled by router output (Li et al., 16 Nov 2025).
Auxiliary Load-Balance Losses: Practically every OmniMoE system includes some form of Switch-style or improved load-balancing loss: usually, $\sigma$ 1 or quadratic load loss $\sigma$ 2, counteracting expert collapse and ensuring diverse usage (Gu et al., 8 Jul 2025, AI et al., 28 Oct 2025, 2502.01074).

Router parameterization typically involves a learned $\sigma$ 3 or its factorized variants; gating is performed via softmax and top- $\sigma$ 4 or top- $\sigma$ 5 assignment at the token level.

3. Multimodal and Cross-Domain Integration

OmniMoE architectures serve as the unifying backbone for large omnimodal LLMs, cross-modal generative models, and molecular science generalists:

Unified Token Space: Inputs from diverse front-end encoders (vision, speech, text, molecular graphs) are projected into a single token embedding space, processed jointly through shared transformer + MoE stacks (Li et al., 16 Nov 2025, AI et al., 28 Oct 2025, Li et al., 2024, Ma et al., 4 Aug 2025, 2502.01074).
Cross-Modal Expert Specialization: Modality-preferred experts are trained or warm-started for specific domains (visual, audio, speech, etc.), and the router learns to assign token types accordingly, while shared experts and null experts reinforce global and computationally-efficient handling of mixed-modality sequences (Li et al., 16 Nov 2025, AI et al., 28 Oct 2025).
Unified Routing: Modality adapters (learned biases in the router for each modality) steer the routing toward domain-preferred experts without hard partitioning, preserving flexibility for cross-domain information flow (AI et al., 28 Oct 2025).
Specializations: Some architectures augment the MoE with 3D Rotary Position Embeddings (3D-RoPE) to enable fine spatial/temporal cross-modal alignment (e.g., for images/video and audio) (Li et al., 16 Nov 2025).

OmniMoE architectures are applied in:

Multimodal LLMs (Uni-MoE-2.0-Omni, Ming-Flash-Omni)
High-throughput, long-context LLMs (VeOmni, Qwen3-MoE on 160k context windows)
Contextual ASR, generative segmentation, image generation/editing, and scientific domains, e.g., chemistry (Li et al., 16 Nov 2025, AI et al., 28 Oct 2025, Ma et al., 4 Aug 2025, Li et al., 2024, 2502.01074).

4. System-Algorithm Co-Design and Hardware-Efficient Execution

Scaling OmniMoE architectures in practice requires explicit co-design of software kernels and parallelization strategies:

Expert-Centric Scheduling: The “expert-centric” execution model in atomic expert OmniMoE (instead of the traditional token-centric) aggregates all token assignments for each expert, processes them in large GEMMs per expert, and thereby minimizes memory bandwidth and maximizes compute efficiency (Shi et al., 5 Feb 2026).
Parallelisms: Deployment leverages data parallelism (DP), expert parallelism (EP), and sequence parallelism (SP, e.g., DeepSpeed Ulysses) so the architecture scales to 128+ GPUs (FSDP+SP+EP) and long context lengths (up to 160k tokens) (Ma et al., 4 Aug 2025).
Hardware Optimizations: Use of Triton kernels for blockwise top-k routing, GPU thread block tiling for Cartesian routers, and specialized sorting for grouping token-expert pairs are common (Shi et al., 5 Feb 2026).
Efficient MoE Layer Integration: In multi-modal backbones, only a subset of FFN layers are replaced by MoE blocks (e.g., every other block), and expert weight sharding reduces overall active memory footprint (Ma et al., 4 Aug 2025, AI et al., 28 Oct 2025).
Empirical Performance: Atomic expert OmniMoE achieves end-to-end speedups of up to 10.9× over prior fine-grained MoEs (6.7 ms vs 73 ms inference latency for 4k tokens), supporting state-of-the-art throughput and parallel scaling behavior (Shi et al., 5 Feb 2026).

5. Training Protocols, Specialization, and Convergence

OmniMoE architectures rely on progressive training strategies and auxiliary objectives to induce both global and expert-level specialization:

Progressive Multi-Stage Training: Typical procedure involves pretraining connectors and encoders for cross-modality alignment; modality-specific expert warm-up; then joint MoE fine-tuning on mixed-modality instructions; often finished with RL or preference optimization (GSPO, DPO) for reasoning chains and stable expert allocation (Li et al., 16 Nov 2025, Li et al., 2024).
Tricks for Routing Stability: Use of null experts, data-balanced annealing, warm initialization from specialized dense models, and adaptive gradient estimators (ODE-based) for router gradient flow are prevalent (Li et al., 16 Nov 2025).
Auxiliary Losses and Regularization: Load-balancing, prior-importance, and variance losses are added to the main task objectives to evenly distribute expert workload and prevent expert collapse (Gu et al., 8 Jul 2025, AI et al., 28 Oct 2025, 2502.01074).
Gradient Stabilization Modules: For instruction-tuned scientific models, adaptive gradient scaling is introduced for PEFT adapters to control update magnitudes in multitask, multimodal optimization, e.g., scaling LoRA updates by learnable $\sigma$ 6 (2502.01074).
Empirical Scaling Laws: OmniMoE models demonstrate parameter and data scaling laws in molecular tasks, maintaining performance gains as tasks, data, or model size increase up to practical limits (2502.01074).

6. Empirical Results and Comparative Evaluations

Multiple independent benchmarks establish the consistent empirical advantages of OmniMoE architectures:

Speech Recognition: The Omni-Router Transformer achieves a mean WER of 10.4% over 10 out-of-domain test sets (2 expert, 140M active parameters), outperforming dense (12.7%) and Switch-Transformer (11.7%) baselines (Gu et al., 8 Jul 2025).
Atomic Expert MoE: Zero-shot accuracy of 50.9% (across seven benchmarks) for a 1.7B activated parameter OmniMoE, outperforms DeepSeekMoE (50.2%) and PEER (48.9%), and achieves a 10.9× lower inference latency versus PEER (Shi et al., 5 Feb 2026).
Multimodal LLMs: Uni-MoE-2.0-Omni and Ming-Flash-Omni exhibit strong performance gains in video understanding (+7%), omnimodal tasks (+7%), AV reasoning (+4%), and context-aware ASR, reducing WER and supporting unified segmentation, generation, and editing (Li et al., 16 Nov 2025, AI et al., 28 Oct 2025).
Scalability: VeOmni (Qwen3-MoE 30B) scales to 160k context windows and delivers up to 2,853 tokens/sec/GPU (FSDP+SP+EP), exceeding 1,000 tokens/sec/GPU at the largest sequence lengths (Ma et al., 4 Aug 2025).
Molecular Generalization: OmniMoE in Omni-Mol demonstrates robust multi-task performance, 1.78× throughput improvement, and 41% FLOPs reduction versus non-MoE baselines, as well as empirical scaling benefits (2502.01074).

7. Architectural Variants and Open Issues

OmniMoE is an umbrella for a broad range of innovations:

Variant	Key Innovations	Domains/Benchmarks
Atomic Experts/MoE	Vector-level experts, Cartesian router, Expert-centric scheduling	Language and reasoning tasks
Omni-Router	Shared routing weights, stronger inter-layer correlation	ASR (large pseudo-labeled corpora)
Dynamic-Capacity MoE	Top-P adaptive gating, routed/shared/null experts, 3D RoPE	Multimodal generative LLMs
Anchor-and-Reconcile	Always-on anchor expert, conflict-resilient specialization	Molecular multitask learning
Ling-Flash/Ming-Flash	Sparse ultralarge MoE, multimodal fusion, SOTA generative segmentation	Multimodal AGI, editing, ASR

Open issues include the optimal balance between expert granularity and routing/communication cost, stability of expert specialization under domain shifts, and the practical integration of these innovations for emerging ultra-long context and high-modality systems.

OmniMoE architectures represent a convergence of advances in sparse expert modeling, expert routing coordination, hardware co-design, and universal multimodal integration, yielding a new regime of scalable, efficient, and robust neural networks for challenging open-domain AI tasks (Shi et al., 5 Feb 2026, Li et al., 16 Nov 2025, AI et al., 28 Oct 2025, Gu et al., 8 Jul 2025, Li et al., 2024, Ma et al., 4 Aug 2025, 2502.01074).