Papers
Topics
Authors
Recent
Search
2000 character limit reached

Nemotron-3 Super & Ultra Models

Updated 2 February 2026
  • Nemotron-3 models are defined as hybrid MoE architectures interleaving Mamba-2 layers with sparse expert blocks and full self-attention for efficient long-context reasoning.
  • They employ NVFP4 low-precision quantization and advanced RL protocols, resulting in enhanced throughput and near-constant memory utilization per token.
  • Optimized for high-volume production, these models support up to 1M-token contexts and deliver superior performance in reasoning, coding, and agentic tasks.

The Nemotron-3 Super and Ultra models constitute the two largest instantiations of NVIDIA’s Nemotron-3 family, designed to deliver scalable, high-throughput agentic and reasoning capabilities suitable for large context windows and demanding production workloads. Employing hybrid architectures, low-precision quantization, and advanced reinforcement learning protocols, these models facilitate efficient serving and advanced planning over long input sequences.

1. Hybrid Mamba–Transformer and LatentMoE Architecture

Nemotron-3 Super and Ultra are architected as hybrid Mixture-of-Experts (MoE) models centered on Mamba-2 sequence-modeling layers. The layer order interleaves:

  • Mamba-2 layers, which maintain a constant, small fixed internal state and supplant standard feed-forward and self-attention modules with efficient state-space recurrences.
  • Sparse expert MoE blocks, providing high nonlinear capacity per computational step.
  • Intermittent full self-attention layers (GQA multi-query heads), allowing global context mixing within the sequence.

In the LatentMoE variant employed by both Super and Ultra, input representations xRdx\in\mathbb{R}^d are projected via a learnable UR×dU\in\mathbb{R}^{\ell\times d} to a lower-dimensional latent space z=Uxz=Ux, where d\ell\ll d. All subsequent gating and expert feed-forward processing occur in this reduced latent space, with post-expert outputs ylaty_\text{lat} returned to the full dimension by VRd×V\in\mathbb{R}^{d\times\ell}: y=x+Vylaty=x+V y_\text{lat}.

The reduction in communication and computation, d/d/\ell, enables both a multiplication of expert diversity and increased token-level parallelism. For Super/Ultra, d/4\ell\approx d/4 (e.g., d=4096d=4096, =1024\ell=1024), thus quadrupling both total experts and active expert selection per token compared to standard MoEs.

Multi-Token Prediction (MTP) blocks are incorporated near the end of the network for speculative decoding: the model outputs MM subsequent token distributions per position via a linear head, trained to maximize acceptance rate on batch-1, long-output scenarios.

This configuration yields nearly constant memory utilization per token, with throughput scaling linearly as context length increases to 1M tokens.

2. Training Procedures and Quantization

Both Nemotron-3 Super and Ultra are initialized via large-scale supervised pretraining. The corpus comprises over 10 trillion tokens, integrating filtered web text, curated code, mathematical sources, and extensive long-context documents. Deduplication and curation is performed using NeMo Data Designer, which supports document-level deduplication and long-horizon sequence structuring up to 512K tokens.

All major parameters are trained natively in NVFP4 quantization using the NVIDIA Transformer Engine. NVFP4 consists of fused 4-bit floating point GEMMs for forward, backward, and weight update passes. Approximately 85% of model layers (apart from projections and certain attention modules) reside entirely in NVFP4, with key “sensitive” components retained in BF16 or MXFP8. On Blackwell Ultra hardware, NVFP4 achieves 3×\sim3\times peak GEMM throughput vs FP8, with empirical loss difference to BF16 remaining under 1%.

After supervised training and SFT, a consolidated multi-environment RLHF-style stage is conducted. Here, actor-learners gather trajectory rollouts concurrently from diverse environments, including mathematical chain-of-thought (GSM8K, MATH), embodied code tasks (APPS, HumanEval++), tool invocation (When2Call), long-context retrieval, and chat agents. Generalized Reward-Penalized Optimization (GRPO) with KL-regularization and masked importance sampling is applied, maximizing:

LRL(θ)=Eτπold[tAtlogπθ(atst)]+λKL[πθπref]\mathcal{L}_\text{RL}(\theta) = - \mathbb{E}_{\tau\sim\pi_\text{old}} \left[\sum_t A_t \log \pi_\theta(a_t|s_t) \right] + \lambda \mathrm{KL}[\pi_\theta\,||\,\pi_\text{ref}]

Here, the advantage AtA_t incorporates GAE with a reference policy anchor. Empirically, joint RL yields monotonic improvements across reasoned tasks, planning, and agentic skills.

3. Performance, Token Throughput, and Scalability

Nemotron-3 Super sustains approximately 1.6M tokens/sec on an 8-GPU NVIDIA Blackwell Ultra cluster (batch-8, sequence length 8K in/8K out), compared to 0.45M tokens/sec for a baseline Transformer-MoE of similar scale. Ultra achieves 2.2M tokens/sec under identical conditions, with active parameter count \sim70B and total \sim400B.

For high-throughput workloads (8K+16K), Super demonstrates 4×4\times greater throughput than Qwen3-70B and 1.8×1.8\times that of Gemma 3-70B. Ultra runs more than 2×2\times Gemma 3 under matched conditions.

Context-length support is native up to 1M tokens, with Mamba layers encoding position internally, and attention layers using RoPE only for inputs 32\leq32K tokens. Sliding-window inference and paging enable scaling without rotary or positional clipping, with accuracy on 1M-context benchmarks (RULER) sustaining above 66—compared to collapse below 25 for similarly sized dense hybrids.

Super and Ultra permit inference cost per 1M-token query of <<0.25 GPU-hours and \approx0.35 GPU-hours, respectively, on Blackwell Ultra hardware. Batch-1 latencies average 40ms/token (Super) and 50ms/token (Ultra), both under 100ms end-to-end for initial tokens, markedly below dense baselines.

4. Quantitative Reasoning and Agentic Capability Benchmarks

After RL post-training, core benchmarks include:

  • MMLU-Pro (5-shot CoT EM): Super \approx49%, Ultra \approx52%
  • GSM8K (8-shot): Super \approx85%, Ultra \approx87%
  • HumanEval++ (3-shot coding): Super \approx65%, Ultra \approx69%
  • Code function calling (APIGen-MT): Super \approx84%pass@1\,\mathrm{pass@1}, Ultra \approx88%pass@1\,\mathrm{pass@1}
  • Chat agentic tasks (WildChat, 1M logs): Super wins 65% (vs.vs. GPT-3.5), Ultra wins 74% (vs.vs. GPT-4)
  • Tool invocation (When2Call): Super selects correct pipeline 91% of the time, Ultra 94%

Performance curves position Ultra at or above GPT-4 in chain-of-thought tasks and long-context agentic benchmarks, with Super outperforming comparable open models (Qwen2.5, Gemma 3).

5. Practical Deployment and Resource Considerations

Nemotron-3 Super and Ultra are architected for agentic, high-volume deployments such as IT ticketing, code review, and long-document reasoning.

  • Inference cost and scalability are optimized for linear efficiency across \geq64 GPU pods, with minimal bandwidth overhead due to LatentMoE’s reduced communication footprint.
  • Recommended hardware comprises NVIDIA Blackwell Ultra or GB300 platforms, utilizing Transformer Engine, GPUDirect RDMA, and NVSwitch for scale-out.
  • Software stack integrates NVIDIA Triton with custom NVFP4 kernels, a fused Mamba-2 operator in the backend, and an MTP speculative-decoding scheduler; distributed RL and fine-tuning utilize open-source NeMo-RL and NeMo-Gym.

All weight checkpoints, training recipes (data designer, NVFP4 schedules, RL environments), and evaluation harnesses are slated for open release (NVIDIA et al., 24 Dec 2025).

6. Comparative and Evolutionary Context

Nemotron-3 Super and Ultra extend the architecture class introduced in Llama-Nemotron (Super: 49B, Ultra: 253B) but contrast via their hybrid Mamba–Transformer backbone, LatentMoE block routing, native NVFP4 quantization, and full 1M-token context support (Bercovich et al., 2 May 2025). Whereas early Nemotron variants leveraged heterogeneous attention and FFN block mixtures, Super and Ultra pursue sparse expert parallelism and constant-memory scaling for long context and generative planning tasks.

Both model families employ RL optimization protocols (GRPO, RPO, multi-environment RL) and open training pipelines (NeMo, Megatron-LM, vLLM), supporting enterprise deployment with commercially permissive licenses and reproducible recipes.

A plausible implication is that the shift to Mamba-State Space MoE and LatentMoE designs is likely to inform future long-context agentic models targeting efficient bandwidth, scalability, and speculatively decoded generation with minimal hardware overhead.

7. Codebases, Datasets, and Open Resources

Key resources announced for Nemotron-3 Super and Ultra include:

  • Model weights and checkpoints for both Super and Ultra.
  • Data-designer toolchain for corpus curation and formatting, including deduplication and sequence structuring.
  • NVFP4 training schedules and quantization recipes.
  • RL environment definitions and trajectory sampling infrastructure (NeMo-RL, NeMo-Gym).
  • MTP scheduling and fast-generation methods in Triton/vLLM backends.
  • Empirical evaluation scripts and harnesses for benchmark replication.

These resources are being released under open-access terms, with datasets for pre- and post-training provided where redistribution rights allow (NVIDIA et al., 24 Dec 2025).


Nemotron-3 Super and Ultra represent hybrid MoE–Mamba architecture designs with advanced RL post-training, NVFP4 throughput optimization, and native million-token context capacity, establishing a benchmark for agentic, scalable, and commercially deployable open LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Nemotron-3 Super and Ultra Models.