TetraJet-v2: Accurate NVFP4 Training for Large Language Models with Oscillation Suppression and Outlier Control

Published 31 Oct 2025 in cs.LG and cs.AI | (2510.27527v1)

Abstract: LLMs training is prohibitively expensive, driving interest in low-precision fully-quantized training (FQT). While novel 4-bit formats like NVFP4 offer substantial efficiency gains, achieving near-lossless training at such low precision remains challenging. We introduce TetraJet-v2, an end-to-end 4-bit FQT method that leverages NVFP4 for activations, weights, and gradients in all linear layers. We identify two critical issues hindering low-precision LLM training: weight oscillation and outliers. To address these, we propose: 1) an unbiased double-block quantization method for NVFP4 linear layers, 2) OsciReset, an algorithm to suppress weight oscillation, and 3) OutControl, an algorithm to retain outlier accuracy. TetraJet-v2 consistently outperforms prior FP4 training methods on pre-training LLMs across varying model sizes up to 370M and data sizes up to 200B tokens, reducing the performance gap to full-precision training by an average of 51.3%.

Abstract PDF Upgrade to Chat

Summary

The paper presents a novel NVFP4-based method that uses unbiased double-block quantization to reduce weight oscillation and manage outlier inaccuracies.
It introduces OsciReset to stabilize weights and OutControl to retain gradient accuracy during low-precision training.
Experimental results demonstrate improved training and validation performance, reducing the gap with full-precision methods for cost-effective LLM training.

Summary of "TetraJet-v2: Accurate NVFP4 Training for LLMs with Oscillation Suppression and Outlier Control"

Introduction

The paper presents TetraJet-v2, a comprehensive approach to training LLMs using fully quantized 4-bit representations. This approach leverages NVFP4 for activations, weights, and gradients, ensuring efficient computation while mitigating common issues such as weight oscillation and outlier inaccuracies. The need for low-precision training stems from the prohibitive cost of training large models, which can exceed hundreds of millions of dollars. NVFP4 offers advantages over previous formats due to its microscale quantization, which reduces quantization errors for tensors containing outliers.

Challenges in Low-Precision Training

Two main challenges are identified: weight oscillation and outlier features. Weight oscillation involves quantized weights fluctuating between bins without significant changes in high precision weights, negatively impacting model performance. Outlier features refer to activation channels with large magnitudes that can't be accurately represented at low precision. TetraJet-v2 proposes solutions to these issues, including unbiased double-block quantization, OsciReset to suppress weight oscillation, and OutControl for outlier accuracy retention.

Figure 1: The distribution of latent weight $w/s$ in OLMo2-150M blocks.11.att_proj in NVFP4 training without oscillation suppression.

TetraJet-v2's NVFP4 Linear Layer Design

TetraJet-v2 employs an unbiased double-block quantization method to accurately convert high precision matrices to NVFP4 format. The proposed approach divides matrices into smaller groups to assign scaling factors efficiently, optimizing representation accuracy while minimizing runtime complexity.

The NVFP4 Linear Layer design involves quantizing inputs with unbiased stochastic rounding to ensure gradient estimation aligns with the high precision model structure. This approach, combined with finer outer-block scaling, improves training stability compared to previous methods.

Oscillation Suppression with OsciReset

OsciReset is introduced to address oscillation by resetting the master weights to the center of the quantization bin, mitigating detrimental fluctuations at low learning rates. The methodology identifies oscillating weights and applies adjustments to stabilize them without degrading global optimization performance.

Figure 2: The change of the oscillation proportion with/without OsciReset on OLMo2-150M.

Outlier Control with OutControl

OutControl uses random Hadamard transformations to manage outliers during backpropagation, maintaining gradient accuracy. Additionally, outlier channels with larger variance are statically selected for higher precision retention, enhancing both forward and backward computations.

Figure 3: Activation magnitudes of MLP input at layer 10 for different GSM8K samples across OLMo2-370M training checkpoints at different steps.

Experimental Evaluation

TetraJet-v2 was evaluated across varying model sizes and token scales, consistently outperforming existing FP4 training methods. Extensive experiments demonstrated reduced performance gaps between FP4 and full-precision training. For example, TetraJet-v2-full achieved significant improvements in both training and validation perplexity across multiple datasets and model settings.

Figure 4: Validation loss curve of OLMo2-370M with about 200B tokens for comparing different methods.

Conclusions and Future Directions

TetraJet-v2 offers a viable path to efficient and accurate low-precision training for LLMs, reducing the computational costs associated with large-scale models. Future work should focus on extending these methods to larger models and token sets and implementing them on practical low-precision hardware to fully leverage the NVFP4 format's benefits. The potential of mixed-precision formats like $\mathtt{FP6\times FP4}$ could offer further improvements and should be explored.