ERNIE 5.0: Unified Multimodal Autoregressive Model
- ERNIE 5.0 is a unified multimodal autoregressive model that processes text, image, video, and audio as token groups with cohesive cross-modal interactions.
- It employs an ultra-sparse mixture-of-experts Transformer backbone with modality-agnostic routing to achieve scalable performance and low computational cost.
- Its elastic training regime enables extraction of efficient sub-networks and stable reinforcement learning post-training for diverse deployment scenarios.
ERNIE 5.0 is a trillion-parameter, natively autoregressive foundation model for unified multimodal understanding and generation across text, image, video, and audio. Trained from scratch without late-fusion modules, it casts all modalities as a single next-group-of-tokens prediction task, employs an ultra-sparse mixture-of-experts (MoE) Transformer backbone with modality-agnostic routing, and introduces a novel elastic training regime. ERNIE 5.0 further develops a robust framework for reinforcement learning post-training under ultra-sparse MoE, ensuring state-of-the-art performance and deployment flexibility for diverse resource constraints. It is the first publicly disclosed production-scale autoregressive model supporting both multimodal understanding and generation at the trillion-parameter scale (Wang et al., 4 Feb 2026).
1. Unified Autoregressive Model and Design Rationale
ERNIE 5.0 is architected to natively support all four major modalities—text, image, video, and audio—under a single autoregressive objective. Distinct from late-fusion “backbone + decoder” designs, it converts each modality’s modeling challenge into the uniform task of predicting the next group of tokens, facilitating deep cross-modal token-level interactions and eradicating the “ability seesaw” typically observed in modular approaches.
Key motivations include:
- Native autoregression for all modalities, eliminating separate decoder architectures.
- A unified learning signal via next-group-of-tokens prediction, ensuring coherent optimization as scale increases.
- Ultra-sparse MoE for scalable capacity, with modality-agnostic routing enabling both expert specialization and cross-modal knowledge sharing at less than 3% activation rate.
- Elastic pre-training to simultaneously train a "super-network" and a spectrum of sub-networks of varying depths, expert counts, and sparsity.
- Multimodal RL approaches that stabilize post-training for ultra-sparse MoE architectures across diverse modalities.
2. Next-Group-of-Tokens Prediction Objective
ERNIE 5.0’s learning framework treats every instance—text sequence, image, video, or audio—as a sequence of non-overlapping token groups:
Given organized into groups , the model optimizes
Token group definitions are modality-specific:
- Text: Uses multi-token prediction (MTP) and occasional rewinds for parallelism.
- Vision: Cascading multi-scale tokenizers with bit-quantized codes. Generation employs next-frame-and-scale prediction (NFSP), causing intra-scale bidirectionality and inter-scale/temporal causality.
- Audio: Hierarchical residual vector quantization (RVQ) generates codec tokens, with next-code prediction (NCP) enabling depth-wise semantic-to-residual modeling via teacher-forced feedback.
This approach enforces a single learning signal for all modalities and enables shared optimization trajectories.
3. Ultra-Sparse Mixture-of-Experts Architecture
The backbone utilizes a Transformer stack in which every few layers include an MoE block instead of a standard feed-forward network. Defining features:
- Modality-agnostic expert routing: A linear projection computes gating scores for each token’s hidden state :
Tokens are dispatched to their top- experts ( in typical configurations).
- Ultra-sparse activation: Fewer than 3% of tokens activate any given expert, significantly reducing FLOPs.
- Auxiliary-loss-free balancing: Expert load balancing is achieved via direct gate bias updates without extra loss terms or hyperparameters.
- Sparse expert aggregation: Each token receives outputs only from selected experts:
where each is a 2-layer MLP expert.
The architecture enables scalable and efficient multimodal capacity sharing while maintaining low activation and computational loads.
4. Elastic Training and Extractable Sub-networks
ERNIE 5.0 advances an elastic training procedure designed to foster a “super-network,” from which a family of sub-models with different residence in the architecture space can be extracted post-training without any further tuning. For each mini-batch, the training pipeline samples:
- Depth : (75% full, 25% random shallower).
- Expert count : (80% full, 20% half).
- Routing sparsity : (80% default, 20% lower).
The corresponding sampled sub-network is trained for the batch. Post-training, any configuration can be selected directly.
Empirical ablation indicates:
| Parameter Varied | Reference Loss | Altered Loss | Performance Impact |
|---|---|---|---|
| Full depth vs. 12L | 1.941 | 2.137 | Shallower architectures slightly worse |
| Full vs. half experts | 1.957 | 2.218 | Reduced expert width modestly worse |
| Top- 4/2/1 | 1.971/2.003/2.175 | – | Lower sparsity graceful drop, 0.15% loss at inference with 25% |
Fully elastic sub-models with 53.7% activation and 35.8% parameters achieve 99.5% average performance of the full model on text and vision tasks.
5. Reinforcement Learning Techniques for Multimodal Post-Training
Post-supervised fine-tuning, ERNIE 5.0 is further optimized by Unified Multimodal RL (UMRL), making RL tractable on ultra-sparse MoE backbones. The pipeline incorporates:
- Unbiased Replay Buffer (U-RB): Extends partial rollouts, enforcing data ordering for efficient and unbiased sequence sampling.
- Multi-granularity Importance Sampling Clipping (MISC): Based on GRPO/GSPO, applies per-token double-sided IS clipping to avoid entropy collapse and stabilize policy optimization:
Out-of-bounds ratios are masked.
- Well-learned Positive Sample Mask (WPSM): Tracks success rate and entropy; once a task is “mastered,” positive gradients are masked to focus on harder queries.
- Adaptive Hint-based RL (AHRL): Prepends partial chain-of-thought hints for sparse-reward problems, with hint length decreasing as training progresses, accelerating convergence on challenging reasoning tasks.
This strategy ensures stable and efficient optimization for unified multimodal objectives.
6. Benchmarks and Empirical Evaluation
ERNIE 5.0 achieves strong and balanced multimodal performance pre- and post-RL post-training:
| Modality / Task | Notable Benchmarks | Performance |
|---|---|---|
| Text | MMLU, BBH, HotPotQA, HumanEval+, LiveCodeBench, MMMLU, INCLUDE | Outperforms DeepSeek V3.2 and Kimi K2 (pre-train); matches/exceeds GPT-5 High, Gemini 3-Pro post-train |
| Vision/Video | ChartQA, DocVQA, OCRBench, SimpleVQA, CountBench, VideoMME, MMVU, GenEval, VBench | Comparable to Qwen-Image, Veo3; strong multimodal alignment |
| Audio | AISHELL-1, LibriSpeech, VoiceBench, MMAU, TUT2017, CochlScene, SEED-TTS | ASR WER: 0.31% (AISHELL-1), 1.16%/2.61% (LibriSpeech), competitive with whisper-based and Qwen-3-Omni |
| Ablations | Elasticity, routing | Modality-agnostic routing yields 1–2% gains; elastic yields 15% decode speedup with <0.2% loss |
Shallow sub-networks (60% depth) incur less than 10% average performance loss while halving latency.
7. Routing Structure Visualization and Insights
Visualization of expert routing across layers yields the following findings:
- Expert activation histograms: Reveal a small core set of experts universally shared, a medium set for text + audio, and extensive tails for modality-specialized experts, particularly in vision.
- Intersection-over-Union (IoU): Text–audio IoU is highest in early layers (semantic overlap), image–video IoU peaks in mid-depths (spatial-feature sharing), and overall semantic alignment increases with depth.
- Normalized entropy (): Text routing is consistently balanced (). Visual and audio routing show alternating specialization and re-integration phases as depth increases.
Best practices emerging from the analysis include the sufficiency of modality-agnostic routing (no manual partitioning required), organic emergence of shared experts for generic reasoning, elastic training as a regularizer, and the efficacy of “corrupt-and-correct” training within NFSP to mitigate error accumulation during long visual generation runs.
ERNIE 5.0 establishes that a unified autoregressive Transformer, augmented with ultra-sparse MoE, elastic “once-for-all” training, and stabilized RL, is capable of state-of-the-art performance across all modalities and tasks, while supporting adaptable deployments under diverse operational constraints (Wang et al., 4 Feb 2026).