Papers
Topics
Authors
Recent
Search
2000 character limit reached

Intern-S1-MO: Scientific Multimodal MoE Model

Updated 12 December 2025
  • Intern-S1-MO is a multimodal foundation model that integrates text, images, time-series, and domain-specific tokens using a 64-layer Transformer with MoE layers to excel in scientific reasoning.
  • It is pretrained on a balanced 5-trillion-token dataset spanning scientific and general domains, including specialized formats like SMILES and FASTA, enabling it to handle diverse scientific tasks.
  • Advanced reinforcement learning, expert routing, and ablation studies contribute to its efficient performance, achieving state-of-the-art results on multiple scientific benchmarks.

Intern-S1-MO is a large-scale, scientific multimodal foundation model that implements a Mixture-of-Experts (MoE) Transformer architecture, designed to excel in scientific reasoning across multiple modalities. It integrates innovations in model architecture, pretraining corpus composition, learning objectives, reinforcement learning strategies, empirical evaluations, and ablation-based insights, targeting exceptional performance in scientific tasks that have traditionally relied on expert models or underperformed with general LLMs (Bai et al., 21 Aug 2025).

1. Model Architecture

The Intern-S1-MO architecture incorporates four input modalities, all projected into a unified 12,288-dimensional embedding space:

  • Natural language: Dynamically tokenized.
  • Visual tokens: Produced by InternViT-6B.
  • Time-series: Encoded by a dedicated time-series encoder.
  • Domain-specific discrete streams: e.g., SMILES for molecules, FASTA for proteins.

These modality-specific embeddings are concatenated into a single sequence input to a 64-layer Transformer backbone based on Qwen3-235B. The architecture interleaves 32 standard dense Transformer layers with 32 MoE layers.

Each MoE layer uses 128 experts, where each expert is a two-layer feed-forward network with hidden state sizes [12,288 → 49,152 → 12,288]. Only the top-2 experts per token are activated per MoE layer, yielding approximately 28 billion activated parameters at inference but a total capacity of 241 billion parameters.

The gating function for MoE layers computes scores g(x)=Wgx+bgg(x) = W_gx + b_g and routes each token to two highest-scoring experts using softmax-normalized weights. Each activated expert computes Ei(x)=W2,iϕ(W1,ix+b1,i)+b2,iE_i(x) = W_{2,i}\phi(W_{1,i} x + b_{1,i}) + b_{2,i}, with GELU nonlinearity (ϕ\phi). This architecture is trained under 1-way Expert Parallelism (FSDP) and can expand to 8-way parallelism at inference to maximize throughput (Bai et al., 21 Aug 2025).

2. Pretraining Corpus and Learning Objectives

Intern-S1-MO is pretrained on a balanced 5-trillion-token dataset:

  • 2.5T scientific-domain tokens: Chemistry, materials science, physics, life sciences, earth sciences, mathematics.
  • 2.5T general tokens: Web and code-crawled text.

The corpus covers multiple formats:

  • Text: Web pages and scientific papers parsed via a hybrid OCR + VLM pipeline.
  • Image–text pairs: Scientific diagrams, charts, microscopy images, remote sensing data.
  • Molecular/protein strings: SMILES, SELFIES, FASTA using a dynamic tokenizer.
  • Time-series: Seismic, EEG, and photometric light curves.

The pretraining objective is a left-to-right autoregressive next-token prediction loss over the interleaved multimodal sequence:

LCPT(θ)=i=2Lwilogpθ(xix1:i1),wi=1(=#predict tokens)\mathcal{L}_{\textrm{CPT}}(\theta) = -\sum_{i=2}^{L} w_i \log p_\theta(x_i|x_{1:i-1}), \qquad w_i = \frac{1}{\sqrt{\ell}} \quad (\ell = \#\textrm{predict tokens})

Visual and time-series tokens are treated as contextual information and are not masked out or predicted. InternViT is contrastively pretrained (CLIP-style) on image–text before CPT. Model computations utilize FP8 matrix kernels with dynamic scaling (Bai et al., 21 Aug 2025).

3. InternBootCamp Reinforcement Learning

Intern-S1-MO undergoes a two-stage post-pretraining reinforcement learning regimen termed InternBootCamp:

  • Offline RL (Supervised Fine-Tuning): "Best-of-N" sampling from expert-curated instruction datasets spanning 12 disciplines (e.g., math, physics, chemistry, coding).
  • Online RL: Rollouts on over 1,000 tasks using a Mixture-of-Rewards (MoR) approach. Various reward sources Rj(x,y)R_j(x, y) (e.g., correctness verifiers, preference models, rule-based checks, environment feedback) are weighted by wjw_j with jwj=1\sum_j w_j = 1:

RMoR(x,y)=j=1JwjRj(x,y)R_{\textrm{MoR}}(x, y) = \sum_{j=1}^J w_j R_j(x, y)

Policy optimization employs an extended OREAL objective with SFT on positives D+D^+, policy-gradient on negatives DD^-, and entropy control via KL-based regularization (KL-Cov). The loss is:

L(θ)=λsftE(x,y)D+Lsft(x,y;θ) +λpgE(x,y)DLpg(x,y;θ)+LKL-Cov(θ)\begin{aligned} \mathcal{L}(\theta) &= \lambda_{\mathrm{sft}}\,\mathbb{E}_{(x,y)\in D^+} L_{\mathrm{sft}}(x, y; \theta) \ &\quad + \lambda_{\mathrm{pg}}\,\mathbb{E}_{(x,y)\in D^-} L_{\mathrm{pg}}(x, y; \theta) + \mathcal{L}_{\mathrm{KL\text{-}Cov}}(\theta) \end{aligned}

where the policy-gradient loss is:

Lpg(x,y;θ)=A(x,y)t=1Tlogπθ(yty<t,x)L_{\mathrm{pg}}(x, y; \theta) = -A(x, y)\sum_{t=1}^T\log\pi_\theta(y_t|y_{<t},x)

and KL regularization applies only to the highest-covariance tokens:

LKL-Cov(θ)=βtIKL[πθold(ht)πθ(ht)]\mathcal{L}_{\mathrm{KL\text{-}Cov}}(\theta) = \beta \sum_{t \in I} \mathrm{KL}[\pi_{\theta_{\textrm{old}}}(\cdot|h_t) \| \pi_\theta(\cdot|h_t)]

where II indexes top-k%k\% tokens by covariance rank (Bai et al., 21 Aug 2025).

4. Empirical Evaluation and Performance

Intern-S1-MO is comprehensively evaluated on both general reasoning and scientific domain tasks, using both text-only and multimodal benchmarks. Key metrics include accuracy or pass rate on:

  • General text-only: MMLU-Pro, GPQA@Diamond, AIME2025, IFEval.
  • General multimodal: MathVista, MMMU, MathVision, MMStar.
  • Scientific text-only: SmolInstruct, ChemBench, MatBench (R2^2, reported as classification accuracy), ProteinLMBench.
  • Scientific multimodal: SFE, PHYSICS, MicroVQA, MSEarth-MCQ, XLRS-Bench.

Table 1: Representative Results (Text-only Science Tasks)

Model SmolInstruct ChemBench MatBench ProteinLMBench
Gemini-2.5 Pro 40.4 82.8 61.7 62.9
OpenAI o3 43.9 81.6 61.6 67.7 (best)
Grok-4 47.3 83.3 67.9 66.2
Deepseek-R1 30.7 75.6 57.7 61.4
Qwen3-235B 28.7 75.8 52.1 59.8
Kimi-K2-Inst. 48.1 (best) 75.3 61.7 66.7
InternVL3-78B 19.4 61.3 49.3 61.6
Qwen2.5-VL-72B 21.0 61.6 51.5 61.0
Intern-S1 51.0 (best) 83.4 (best) 75.0 (best) 63.1

Intern-S1 achieves state-of-the-art open-source results on three out of four scientific tasks and competitive or superior to multiple closed-source models (Bai et al., 21 Aug 2025).

5. Ablation Studies and Architectural Insights

MoE Layer and Expert Scaling

Increasing the number of experts per MoE layer (MM) from 64 to 128 to 256 leads to increased model capacity, but empirical gains saturate at M=128M=128 when routing is fixed at k=2k=2.

Gating Design

Comparisons of gating mechanisms (softmax vs. Gumbel-softmax/top-1) found that top-2 softmax routing yields optimal stability and throughput.

MoR Weighting in RL

Curriculum-annealed weighting of the MoR reward sources, where weights wjw_j shift toward harder tasks over training, improves mean multitask performance by 4%.

Efficiency

Top-2 MoE routing reduces floating-point operations by approximately 5×5\times compared to a dense 241B Transformer. The Intern-S1-mini configuration (8B activated, 100B total) achieves about 90% of full-model science benchmark performance at a third of inference cost.

Multimodal and Tokenizer Advances

Joint contrastive pretraining and CPT on image–text boosts chart/mathematical benchmark performance by 8–12% absolute over text-only variants. The dynamic tokenizer delivers 70% higher compression rate (CR) on SMILES sequences, yielding 40% faster token throughput and +5% chemical reasoning downstream (Bai et al., 21 Aug 2025).

6. Significance and Outlook

Intern-S1-MO provides a scalable, generalist foundation model tailored for scientific reasoning, addressing the historical gap in open-domain, publicly-accessible models for professional scientific tasks. The integration of large-scale multimodal data, advanced MoE transformer scaling, and novel RL with mixture-of-rewards leads to competitive and, in several domains, state-of-the-art performance. The model is publicly released, facilitating future research in both foundational science AI and multimodal learning (Bai et al., 21 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Intern-S1-MO.