Nemotron-4 340B LLMs: Architecture & Performance
- Nemotron-4 340B models are a family of 340-billion-parameter large language models built on a decoder-only Transformer architecture with synthesis-driven alignment and scalable deployment.
- They utilize a fully open-sourced training pipeline that blends traditional autoregressive pretraining over 9 trillion tokens with extensive synthetic data-driven alignment to reduce reliance on human feedback.
- The models deliver state-of-the-art results on diverse NLP tasks, excelling in natural language understanding, instruction following, and reward modeling benchmarks.
The Nemotron-4 340B model family comprises open-access, 340-billion-parameter LLMs developed and released by NVIDIA with a focus on synthesis-driven alignment, scalable deployment, and permissive licensing. The family includes three variants—Nemotron-4-340B-Base, Nemotron-4-340B-Instruct, and Nemotron-4-340B-Reward—all architected atop a uniform decoder-only Transformer backbone. These models achieve state-of-the-art performance among open-source systems across a range of natural language understanding, instruction following, and reward modeling tasks. The pipeline, infrastructure, and datasets for both training and alignment are fully open-sourced to foster research reproducibility and downstream application (NVIDIA et al., 2024).
1. Model Architecture and Configurations
All Nemotron-4-340B variants implement a standard causal Transformer architecture incorporating key design components:
- Backbone: Decoder-only Transformer with rotary position embeddings (RoPE) and squared-ReLU MLP activations.
- Self-Attention: Grouped Query Attention (GQA) for increased throughput. No bias terms. Dropout is set to zero.
- Embeddings: Input/output embeddings are untied.
- Key Hyperparameters:
| Parameter | Value | |----------------------------------|-----------| | Transformer layers () | 96 | | Hidden size () | 18432 | | Attention heads () | 96 | | KV heads (GQA per layer) | 8 | | Sequence length | 4096 | | Vocabulary size | 256,000 | | Total parameters | ~340B |
Of the ~340B parameters, 9.4B reside in embeddings, while the remaining 331.6B are in non-embedding weights. The Instruct and Reward variants maintain this core, with the Reward model appending a small linear "reward head."
Core Operations:
Self-attention in each layer: Feed-forward operation employs squared ReLU:
Reward Model Head:
Projects the final hidden state for the final token into a 5-dimensional attribute vector (Helpfulness, Correctness, Coherence, Complexity, Verbosity): A weighted sum aggregates these attributes into a scalar reward .
2. Pretraining and Alignment Methodology
2.1 Pretraining
Pretraining utilizes a corpus of 9 trillion tokens: 70% English (web, news, books, scientific), 15% multilingual (53 languages), and 15% source code (43 languages). The base objective is standard autoregressive language modeling: A continued pretraining regime, conducted after 8T tokens, up-weights higher-quality and question-answering material for an additional 1T tokens with a steeper learning-rate decay.
2.2 Synthetic Data–Driven Alignment
More than 98% of alignment data is synthesized, reducing dependence on large-scale human annotation. The pipeline, open-sourced via NeMo-Aligner, comprises:
- Prompt Preparation: Single-turn (e.g., writing, math, coding) and instruction-following prompts (explicit formats), two-turn prompts for dialog, and real prompts from LMSYS-Chat-1M.
- Synthetic Dialogue Generation: Role-play dialogues (three turns) generated by intermediate instruct models; quality filtered using Nemotron-4-340B-Reward scoring.
- Synthetic Preference Data: Triplets are constructed using ground-truth or judge models (GSM8K/MATH), LLM-as-Judge (initially), and Reward-model-as-Judge (final). This process yields ~300K synthetic preferences.
- Iterative Weak-to-Strong Alignment: Alternation between intermediate generator models (starting from Mixtral-8x7B-Instruct), fine-tuning, and data regeneration, across three improvement cycles.
2.3 Alignment Algorithms
- Supervised Fine-Tuning (SFT):
- Code SFT (∼800K synthetic code examples, Genetic Instruct pipeline, LLM “fitness” validation): one epoch, LR = , batch 128.
- General SFT (200K mixed-task + 2% code), three epochs, LR .
- Preference Fine-Tuning:
- Direct Preference Optimization (DPO): Maximizes log-likelihood gap between chosen/rejected responses (with KL penalty).
- Reward-aware Preference Optimization (RPO): Matches model’s implicit preference gap to actual reward gap:
where is a small-variance KL divergence. RPO, applied after DPO initialization, further improves alignment metrics.
3. Benchmarking and Empirical Performance
3.1 Hardware and Optimizations
- Fits inference on a single NVIDIA DGX H100 with 8 × H100 SXM5 80GB GPUs, via FP8 tensor-core precision.
- During pretraining: 8-way tensor, 12-way pipeline, and data parallelism (DP from 16 to 64), achieving ~41% Model FLOP/s Utilization (MFU).
3.2 Evaluation Results
Main Benchmarks:
| Model | Task | Score |
|---|---|---|
| Nemotron-4-340B-Base | ARC-Ch. | 94.28 |
| Winogrande | 89.50 | |
| Hellaswag | 90.53 | |
| MMLU | 81.10 | |
| BBH | 85.44 | |
| HumanEval | 57.32 | |
| Nemotron-4-340B-Instruct | Arena Hard | 54.2% |
| AlpacaEval | 41.5% | |
| MT-Bench | 8.22/10 | |
| GSM8K | 92.3% | |
| HumanEval | 73.2% |
Reward Modeling (RewardBench subcategories):
| Model | Overall | Chat | Chat-Hard | Safety | Reason. |
|---|---|---|---|---|---|
| Nemotron-4-340B-Reward | 92.0 | 95.8 | 87.1 | 91.5 | 93.7 |
Human Evaluation:
On GPT-4-1106-preview: overall Win/Tie/Loss = 28.2% / 46.6% / 25.2% (10 categories); perceived optimal length: 79.4% for Nemotron-4-340B-Instruct vs. 74.0% for GPT-4.
Safety and Red-Teaming:
On the AEGIS safety benchmark: very low unsafe-response rate (comparable to Llama-3-70B-Instruct); nominal to good on Garak vulnerability scanning, with minor issues on adversarial hallucinations and malware.
4. Synthetic Data Generation and Pipeline
The nemotron alignment workflow is highly dependent on synthetic data, surpassing 98% of alignment supervision. The fully open-sourced pipeline comprises:
- Prompt generation scripts (math, coding, dialog, format-specific, and real-worldly distributed prompts)
- Role-playing conversational simulators for synthetic multi-turn data
- Quality and preference filtering components (including reward-model judging)
This infrastructure enables "iterative weak-to-strong alignment," a process by which intermediate models are cyclically improved as data quality and policy improve in tandem.
A plausible implication is that this high-quality synthetic generation and judging pipeline reduces the reliance on expensive human feedback, enabling scalable RLHF for both alignment and reward modeling.
5. Licensing, Release, and Community Impact
All models and supporting infrastructure are released under the NVIDIA Open Model License Agreement. The license permits commercial and research usage, redistribution, and creation of derivative works, requiring only attribution to NVIDIA.
Open-sourced components include:
- Pretraining code (Megatron-LM)
- Alignment/reward model training code (NeMo-Aligner)
- Synthetic data pipelines (including prompts, role-playing, filtering scripts)
Such transparency and permissiveness are designed to foster rapid, responsible innovation in large-scale language modeling. The comprehensive synthetic data pipeline is positioned to support research in model alignment, reward learning, and the generation of high-quality supervised datasets for training smaller or domain-specialized LLMs. Integrating Nemotron-4-340B-Reward into RLHF and data-filtering workflows is explicitly encouraged (NVIDIA et al., 2024).