Nemotron-4 340B LLMs: Architecture & Performance

Updated 20 February 2026

Nemotron-4 340B models are a family of 340-billion-parameter large language models built on a decoder-only Transformer architecture with synthesis-driven alignment and scalable deployment.
They utilize a fully open-sourced training pipeline that blends traditional autoregressive pretraining over 9 trillion tokens with extensive synthetic data-driven alignment to reduce reliance on human feedback.
The models deliver state-of-the-art results on diverse NLP tasks, excelling in natural language understanding, instruction following, and reward modeling benchmarks.

The Nemotron-4 340B model family comprises open-access, 340-billion-parameter LLMs developed and released by NVIDIA with a focus on synthesis-driven alignment, scalable deployment, and permissive licensing. The family includes three variants—Nemotron-4-340B-Base, Nemotron-4-340B-Instruct, and Nemotron-4-340B-Reward—all architected atop a uniform decoder-only Transformer backbone. These models achieve state-of-the-art performance among open-source systems across a range of natural language understanding, instruction following, and reward modeling tasks. The pipeline, infrastructure, and datasets for both training and alignment are fully open-sourced to foster research reproducibility and downstream application (NVIDIA et al., 2024).

1. Model Architecture and Configurations

All Nemotron-4-340B variants implement a standard causal Transformer architecture incorporating key design components:

Backbone: Decoder-only Transformer with rotary position embeddings (RoPE) and squared-ReLU MLP activations.
Self-Attention: Grouped Query Attention (GQA) for increased throughput. No bias terms. Dropout is set to zero.
Embeddings: Input/output embeddings are untied.
Key Hyperparameters:

| Parameter | Value | |----------------------------------|-----------| | Transformer layers ( $L$ ) | 96 | | Hidden size ( $d_{\mathrm{model}}$ ) | 18432 | | Attention heads ( $H$ ) | 96 | | KV heads (GQA per layer) | 8 | | Sequence length | 4096 | | Vocabulary size | 256,000 | | Total parameters | ~340B |

Of the ~340B parameters, 9.4B reside in embeddings, while the remaining 331.6B are in non-embedding weights. The Instruct and Reward variants maintain this core, with the Reward model appending a small linear "reward head."

Core Operations:

Self-attention in each layer: $A(X) = \operatorname{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V,\quad Q = XW_Q,\ K = XW_K,\ V = XW_V$ Feed-forward operation employs squared ReLU: $\mathrm{MLP}(x) = W_2 \left(\max(0, W_1 x)\right)^2$

Reward Model Head:

Projects the final hidden state for the final token into a 5-dimensional attribute vector (Helpfulness, Correctness, Coherence, Complexity, Verbosity): $r_{\text{attrs}} = W_{\text{reward}} h_T(x),\quad r_{\text{attrs}} \in \mathbb{R}^5$ A weighted sum aggregates these attributes into a scalar reward $r^*(x, y)$ .

2. Pretraining and Alignment Methodology

2.1 Pretraining

Pretraining utilizes a corpus of 9 trillion tokens: 70% English (web, news, books, scientific), 15% multilingual (53 languages), and 15% source code (43 languages). The base objective is standard autoregressive language modeling: $\mathcal{L}_{\mathrm{LM}} = -\sum_{t=1}^T \log p_\theta(x_t \mid x_{<t})$ A continued pretraining regime, conducted after 8T tokens, up-weights higher-quality and question-answering material for an additional 1T tokens with a steeper learning-rate decay.

2.2 Synthetic Data–Driven Alignment

More than 98% of alignment data is synthesized, reducing dependence on large-scale human annotation. The pipeline, open-sourced via NeMo-Aligner, comprises:

Prompt Preparation: Single-turn (e.g., writing, math, coding) and instruction-following prompts (explicit formats), two-turn prompts for dialog, and real prompts from LMSYS-Chat-1M.
Synthetic Dialogue Generation: Role-play dialogues (three turns) generated by intermediate instruct models; quality filtered using Nemotron-4-340B-Reward scoring.
Synthetic Preference Data: Triplets $(x, y_{\text{chosen}}, y_{\text{rejected}})$ are constructed using ground-truth or judge models (GSM8K/MATH), LLM-as-Judge (initially), and Reward-model-as-Judge (final). This process yields ~300K synthetic preferences.
Iterative Weak-to-Strong Alignment: Alternation between intermediate generator models (starting from Mixtral-8x7B-Instruct), fine-tuning, and data regeneration, across three improvement cycles.

2.3 Alignment Algorithms

Supervised Fine-Tuning (SFT):
- Code SFT (∼800K synthetic code examples, Genetic Instruct pipeline, LLM “fitness” validation): one epoch, LR = $3 \times 10^{-7}$ , batch 128.
- General SFT (200K mixed-task + 2% code), three epochs, LR $\in [1 \times 10^{-7}, 5 \times 10^{-7}]$ .
Preference Fine-Tuning:
- Direct Preference Optimization (DPO): Maximizes log-likelihood gap between chosen/rejected responses (with KL penalty).
- Reward-aware Preference Optimization (RPO): Matches model’s implicit preference gap to actual reward gap:
$\mathcal{L}_{\mathrm{RPO}}(x, y_c, y_\ell) = \mathbb{D}\left[\beta \left(\log\frac{\pi(y_c|x)}{\pi_{\text{ref}}(y_c|x)} - \log\frac{\pi(y_\ell|x)}{\pi_{\text{ref}}(y_\ell|x)}\right) \| (r^*(x, y_c) - r^*(x, y_\ell))\right]$

where $\mathbb{D}[a \| b]$ is a small-variance KL divergence. RPO, applied after DPO initialization, further improves alignment metrics.

3. Benchmarking and Empirical Performance

3.1 Hardware and Optimizations

Fits inference on a single NVIDIA DGX H100 with 8 × H100 SXM5 80GB GPUs, via FP8 tensor-core precision.
During pretraining: 8-way tensor, 12-way pipeline, and data parallelism (DP from 16 to 64), achieving ~41% Model FLOP/s Utilization (MFU).

3.2 Evaluation Results

Main Benchmarks:

Model	Task	Score
Nemotron-4-340B-Base	ARC-Ch.	94.28
	Winogrande	89.50
	Hellaswag	90.53
	MMLU	81.10
	BBH	85.44
	HumanEval	57.32
Nemotron-4-340B-Instruct	Arena Hard	54.2%
	AlpacaEval	41.5%
	MT-Bench	8.22/10
	GSM8K	92.3%
	HumanEval	73.2%

Reward Modeling (RewardBench subcategories):

Model	Overall	Chat	Chat-Hard	Safety	Reason.
Nemotron-4-340B-Reward	92.0	95.8	87.1	91.5	93.7

Human Evaluation:

On GPT-4-1106-preview: overall Win/Tie/Loss = 28.2% / 46.6% / 25.2% (10 categories); perceived optimal length: 79.4% for Nemotron-4-340B-Instruct vs. 74.0% for GPT-4.

Safety and Red-Teaming:

On the AEGIS safety benchmark: very low unsafe-response rate (comparable to Llama-3-70B-Instruct); nominal to good on Garak vulnerability scanning, with minor issues on adversarial hallucinations and malware.

4. Synthetic Data Generation and Pipeline

The nemotron alignment workflow is highly dependent on synthetic data, surpassing 98% of alignment supervision. The fully open-sourced pipeline comprises:

Prompt generation scripts (math, coding, dialog, format-specific, and real-worldly distributed prompts)
Role-playing conversational simulators for synthetic multi-turn data
Quality and preference filtering components (including reward-model judging)

This infrastructure enables "iterative weak-to-strong alignment," a process by which intermediate models are cyclically improved as data quality and policy improve in tandem.

A plausible implication is that this high-quality synthetic generation and judging pipeline reduces the reliance on expensive human feedback, enabling scalable RLHF for both alignment and reward modeling.

5. Licensing, Release, and Community Impact

All models and supporting infrastructure are released under the NVIDIA Open Model License Agreement. The license permits commercial and research usage, redistribution, and creation of derivative works, requiring only attribution to NVIDIA.

Open-sourced components include:

Pretraining code (Megatron-LM)
Alignment/reward model training code (NeMo-Aligner)
Synthetic data pipelines (including prompts, role-playing, filtering scripts)

Such transparency and permissiveness are designed to foster rapid, responsible innovation in large-scale language modeling. The comprehensive synthetic data pipeline is positioned to support research in model alignment, reward learning, and the generation of high-quality supervised datasets for training smaller or domain-specialized LLMs. Integrating Nemotron-4-340B-Reward into RLHF and data-filtering workflows is explicitly encouraged (NVIDIA et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

Nemotron-4 340B Technical Report (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Nemotron-4 340B Models.

Nemotron-4 340B LLMs: Architecture & Performance

1. Model Architecture and Configurations

2. Pretraining and Alignment Methodology

2.1 Pretraining

2.2 Synthetic Data–Driven Alignment

2.3 Alignment Algorithms

3. Benchmarking and Empirical Performance

3.1 Hardware and Optimizations

3.2 Evaluation Results

4. Synthetic Data Generation and Pipeline

5. Licensing, Release, and Community Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Nemotron-4 340B LLMs: Architecture & Performance

1. Model Architecture and Configurations

2. Pretraining and Alignment Methodology

2.1 Pretraining

2.2 Synthetic Data–Driven Alignment

2.3 Alignment Algorithms

3. Benchmarking and Empirical Performance

3.1 Hardware and Optimizations

3.2 Evaluation Results

4. Synthetic Data Generation and Pipeline

5. Licensing, Release, and Community Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research