ERNIE 5.0: Unified Multimodal Autoregressive Model

Updated 7 February 2026

ERNIE 5.0 is a unified multimodal autoregressive model that processes text, image, video, and audio as token groups with cohesive cross-modal interactions.
It employs an ultra-sparse mixture-of-experts Transformer backbone with modality-agnostic routing to achieve scalable performance and low computational cost.
Its elastic training regime enables extraction of efficient sub-networks and stable reinforcement learning post-training for diverse deployment scenarios.

ERNIE 5.0 is a trillion-parameter, natively autoregressive foundation model for unified multimodal understanding and generation across text, image, video, and audio. Trained from scratch without late-fusion modules, it casts all modalities as a single next-group-of-tokens prediction task, employs an ultra-sparse mixture-of-experts (MoE) Transformer backbone with modality-agnostic routing, and introduces a novel elastic training regime. ERNIE 5.0 further develops a robust framework for reinforcement learning post-training under ultra-sparse MoE, ensuring state-of-the-art performance and deployment flexibility for diverse resource constraints. It is the first publicly disclosed production-scale autoregressive model supporting both multimodal understanding and generation at the trillion-parameter scale (Wang et al., 4 Feb 2026).

1. Unified Autoregressive Model and Design Rationale

ERNIE 5.0 is architected to natively support all four major modalities—text, image, video, and audio—under a single autoregressive objective. Distinct from late-fusion “backbone + decoder” designs, it converts each modality’s modeling challenge into the uniform task of predicting the next group of tokens, facilitating deep cross-modal token-level interactions and eradicating the “ability seesaw” typically observed in modular approaches.

Key motivations include:

Native autoregression for all modalities, eliminating separate decoder architectures.
A unified learning signal via next-group-of-tokens prediction, ensuring coherent optimization as scale increases.
Ultra-sparse MoE for scalable capacity, with modality-agnostic routing enabling both expert specialization and cross-modal knowledge sharing at less than 3% activation rate.
Elastic pre-training to simultaneously train a "super-network" and a spectrum of sub-networks of varying depths, expert counts, and sparsity.
Multimodal RL approaches that stabilize post-training for ultra-sparse MoE architectures across diverse modalities.

2. Next-Group-of-Tokens Prediction Objective

ERNIE 5.0’s learning framework treats every instance—text sequence, image, video, or audio—as a sequence of non-overlapping token groups:

Given $\mathbf{x} = (x_1, \ldots, x_T)$ organized into groups $G_1, G_2, \ldots$ , the model optimizes

$\mathcal{L} = -\sum_{g=1}^{G} \log P(G_g \mid \mathbf{x}_{<g}) = -\sum_{g=1}^{G}\sum_{i=1}^{|G_g|} \log P(x_{t_{g-1}+i} \mid \mathbf{x}_{<t_{g-1}+i}).$

Token group definitions are modality-specific:

Text: Uses multi-token prediction (MTP) and occasional rewinds for parallelism.
Vision: Cascading multi-scale tokenizers with bit-quantized codes. Generation employs next-frame-and-scale prediction (NFSP), causing intra-scale bidirectionality and inter-scale/temporal causality.
Audio: Hierarchical residual vector quantization (RVQ) generates codec tokens, with next-code prediction (NCP) enabling depth-wise semantic-to-residual modeling via teacher-forced feedback.

This approach enforces a single learning signal for all modalities and enables shared optimization trajectories.

3. Ultra-Sparse Mixture-of-Experts Architecture

The backbone utilizes a Transformer stack in which every few layers include an MoE block instead of a standard feed-forward network. Defining features:

Modality-agnostic expert routing: A linear projection computes gating scores for each token’s hidden state $h\in\mathbb{R}^d$ :

$s = W_g h + b_g, \quad \pi = \text{softmax}(s) \in \mathbb{R}^E$

Tokens are dispatched to their top- $k$ experts ( $k=2$ in typical configurations).

Ultra-sparse activation: Fewer than 3% of tokens activate any given expert, significantly reducing FLOPs.
Auxiliary-loss-free balancing: Expert load balancing is achieved via direct gate bias updates without extra loss terms or hyperparameters.
Sparse expert aggregation: Each token receives outputs only from $k$ selected experts:

$\text{MoE}(h) = \sum_{j\in \text{TopK}(\pi)} \pi_j E_j(h)$

where each $E_j$ is a 2-layer MLP expert.

The architecture enables scalable and efficient multimodal capacity sharing while maintaining low activation and computational loads.

4. Elastic Training and Extractable Sub-networks

ERNIE 5.0 advances an elastic training procedure designed to foster a “super-network,” from which a family of sub-models with different residence in the architecture space can be extracted post-training without any further tuning. For each mini-batch, the training pipeline samples:

Depth $L'$ : $L' \in [L_{min}, L_{max}]$ (75% full, 25% random shallower).
Expert count $E'$ : $E' \in \{E_{min}, E_{max}\}$ (80% full, 20% half).
Routing sparsity $k'$ : $k' \in [k_{min}, k_{max}]$ (80% default, 20% lower).

The corresponding sampled sub-network is trained for the batch. Post-training, any configuration $\mathcal{M}(L',E',k')$ can be selected directly.

Empirical ablation indicates:

Parameter Varied	Reference Loss	Altered Loss	Performance Impact
Full depth vs. 12L	1.941	2.137	Shallower architectures slightly worse
Full vs. half experts	1.957	2.218	Reduced expert width modestly worse
Top- $k$ 4/2/1	1.971/2.003/2.175	–	Lower sparsity graceful drop, 0.15% loss at inference with 25% $k$

Fully elastic sub-models with 53.7% activation and 35.8% parameters achieve 99.5% average performance of the full model on text and vision tasks.

5. Reinforcement Learning Techniques for Multimodal Post-Training

Post-supervised fine-tuning, ERNIE 5.0 is further optimized by Unified Multimodal RL (UMRL), making RL tractable on ultra-sparse MoE backbones. The pipeline incorporates:

Unbiased Replay Buffer (U-RB): Extends partial rollouts, enforcing data ordering for efficient and unbiased sequence sampling.
Multi-granularity Importance Sampling Clipping (MISC): Based on GRPO/GSPO, applies per-token double-sided IS clipping to avoid entropy collapse and stabilize policy optimization:

$s(y) = \exp\left(\frac{1}{|y|}\sum_j \ln\frac{\pi_{\text{train}}(y_j|\cdot)}{\pi_{\text{old}}(y_j|\cdot)}\right)$

Out-of-bounds ratios are masked.

Well-learned Positive Sample Mask (WPSM): Tracks success rate and entropy; once a task is “mastered,” positive gradients are masked to focus on harder queries.
Adaptive Hint-based RL (AHRL): Prepends partial chain-of-thought hints for sparse-reward problems, with hint length decreasing as training progresses, accelerating convergence on challenging reasoning tasks.

This strategy ensures stable and efficient optimization for unified multimodal objectives.

6. Benchmarks and Empirical Evaluation

ERNIE 5.0 achieves strong and balanced multimodal performance pre- and post-RL post-training:

Modality / Task	Notable Benchmarks	Performance
Text	MMLU, BBH, HotPotQA, HumanEval+, LiveCodeBench, MMMLU, INCLUDE	Outperforms DeepSeek V3.2 and Kimi K2 (pre-train); matches/exceeds GPT-5 High, Gemini 3-Pro post-train
Vision/Video	ChartQA, DocVQA, OCRBench, SimpleVQA, CountBench, VideoMME, MMVU, GenEval, VBench	Comparable to Qwen-Image, Veo3; strong multimodal alignment
Audio	AISHELL-1, LibriSpeech, VoiceBench, MMAU, TUT2017, CochlScene, SEED-TTS	ASR WER: 0.31% (AISHELL-1), 1.16%/2.61% (LibriSpeech), competitive with whisper-based and Qwen-3-Omni
Ablations	Elasticity, routing	Modality-agnostic routing yields 1–2% gains; elastic $k$ yields 15% decode speedup with <0.2% loss

Shallow sub-networks (60% depth) incur less than 10% average performance loss while halving latency.

7. Routing Structure Visualization and Insights

Visualization of expert routing across layers yields the following findings:

Expert activation histograms: Reveal a small core set of experts universally shared, a medium set for text + audio, and extensive tails for modality-specialized experts, particularly in vision.
Intersection-over-Union (IoU): Text–audio IoU is highest in early layers (semantic overlap), image–video IoU peaks in mid-depths (spatial-feature sharing), and overall semantic alignment increases with depth.
Normalized entropy ( $\mathrm{NE}$ ): Text routing is consistently balanced ( $\mathrm{NE}\approx0.95$ ). Visual and audio routing show alternating specialization and re-integration phases as depth increases.

Best practices emerging from the analysis include the sufficiency of modality-agnostic routing (no manual partitioning required), organic emergence of shared experts for generic reasoning, elastic training as a regularizer, and the efficacy of “corrupt-and-correct” training within NFSP to mitigate error accumulation during long visual generation runs.

ERNIE 5.0 establishes that a unified autoregressive Transformer, augmented with ultra-sparse MoE, elastic “once-for-all” training, and stabilized RL, is capable of state-of-the-art performance across all modalities and tasks, while supporting adaptable deployments under diverse operational constraints (Wang et al., 4 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

ERNIE 5.0 Technical Report (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ERNIE 5.0.