SignRoundV2: Low-Bit LLM Quantization
- SignRoundV2 is a post-training quantization framework that enables extremely low-bit (2–5 bits) quantization of large language models with nearly full-precision accuracy.
- It introduces a gradient-informed DeltaLoss metric for layer sensitivity and a calibration-based scale search to optimize bit allocation and initialization.
- SignRoundV2 demonstrates robust, production-grade performance—recovering 95–99% of full-precision accuracy—while reducing memory footprint and inference latency.
SignRoundV2 is a post-training quantization (PTQ) framework designed to enable extremely low-bit quantization (2–5 bits) of LLMs with minimal loss in accuracy relative to full-precision baselines. It introduces two principal innovations: a first-order, gradient-informed layerwise sensitivity metric (“DeltaLoss”) that directs bit allocation, and a lightweight, calibration-based search for optimal scale initialization prior to quantization. SignRoundV2 demonstrates production-grade performance—within ~1% of full-precision—in the 4–5 bit regime and delivers strong results even with 2-bit weight quantization, advancing the state of efficient LLM deployment (Cheng et al., 4 Dec 2025).
1. Mathematical Formulation
SignRoundV2 employs a symmetric, scale-driven quantizer for both weights and activations. For a full-precision tensor , quantization bit-width , and scale , the quantize-dequantize operator is defined as:
where is round-to-nearest; saturates to with , ; and (Equation 2).
SignRoundV2 maintains compatibility with the trainable offsets (0) introduced in SignRound V1 while focusing on initialization and bit allocation. For each weight tensor 1, the quantization deviation is defined:
2
The sensitivity of each layer 3 is estimated via a first-order Taylor expansion as:
4
where 5, 6, 7, 8 are full-precision and quantized activations, 9, 0 are full-precision and quantized weights (Equation 3). In practice, activation errors dominate, leading to the DeltaLoss sensitivity metric:
1
as in Equation 4.
2. Algorithmic Components
2.1 Layer-wise Bit Allocation
SignRoundV2 formulates a 0–1 integer program to minimize aggregate layerwise DeltaLoss under an average bit-width constraint. Given 2 layers, allowed bit-widths 3, and a global average-bit target 4, the optimization is:
5
subject to
6
where 7 is the parameter count in layer 8. This is efficiently solved using dynamic programming in 9 time.
| Step | Description | Time Complexity |
|---|---|---|
| DeltaLoss computation | Compute 0 for all 1 | 2 |
| DP bit allocation | Optimize per-layer bits under global constraint | 3 |
2.2 Lightweight Pre-Tuning Scale Search
A calibration-driven, grid-based search finds an effective scale 4 for each layer before any gradient-based tuning. The initialization objective is:
5
(Equation 6), where 6 collects per-input-channel activation maxima from a small calibration set. Candidate scales 7 are scanned, and the minimizer is chosen as 8. This is optionally refined by a trainable scalar 9 for fine-tuning.
2.3 Tuning Procedure
Each transformer block undergoes 200 sign-gradient descent steps (or up to 500 in extended “Ours*” experiments) on a blockwise reconstruction MSE loss. The learning rate is 0, with batch size 8 and sequence length 2048, using mixed precision for improved computational throughput. To reduce outlier influence, the largest 1 of squared errors in each block are excluded.
3. Evaluation: Models, Benchmarks, and Results
3.1 Models and Tasks
SignRoundV2 has been evaluated on LLaMA 2 (7B, 13B, 70B), LLaMA 3 (8B, 70B), and Qwen (2.5B, 8B, 32B) models. The benchmark suite includes ARC-Easy, ARC-Challenge, BoolQ, HellaSwag, LAMBADA, MMLU, OpenBookQA, PIQA, TruthfulQA, and WinoGrande.
3.2 Quantitative Performance
For 2-bit weights (W2A16):
| Method | LLaMA2-7B | LLaMA2-13B | LLaMA2-70B |
|---|---|---|---|
| GPTQ (W2A16) | 41.6% | 48.3% | 34.4% |
| AWQ (W2A16) | 34.7% | 36.0% | 35.5% |
| OmniQ (W2A16) | 47.0% | 53.6% | 54.9% |
| SignRound V1 | 54.5% | 60.7% | 67.7% |
| SignRound V2 | 57.9% | 61.9% | 68.4% |
At 4–5 bits average (MXFP4/8), SignRoundV2 achieves 95–99% recovery of full-precision accuracy, sustaining 2 variance.
4. Ablation Studies and Comparative Analysis
4.1 Initialization
Initialization with scale pre-tuning yields gains of 5–10 percentage points in average accuracy over “without init” baselines: Qwen3-8B improves from 48–54% to 56–66%; LLaMA3.1-8B from 48–53% to 50–60%.
4.2 DeltaLoss-Only vs. Full Tuning
The DeltaLoss-only mode (no gradient-based SignRound tuning) already surpasses heuristic methods such as “head-8bit,” “tail-8bit,” and RTN. Full SignRoundV2 further improves accuracy by approximately 1–2 percentage points due to sign-gradient rounding.
4.3 Mixed vs. Uniform Precision
In pure 2-bit (W2A16) mode, uniform-precision SignRoundV2 nearly closes the gap with mixed-precision setups. For 4–5 bits, uniform allocation plus SignRoundV2 achieves 3 recovery; mixed-precision yields only marginal gains.
5. Implementation and Practical Considerations
5.1 Pipeline and Resource Profile
SignRoundV2 is available as open source at https://github.com/intel/auto-round, offering routines for DeltaLoss computation, dynamic bit allocation, pre-tuning, and blockwise SignRound training. Default hyperparameters include 200 steps per block, batch size 8, and 128 calibration samples. The pipeline, for one LLM instance, typically proceeds as follows:
- Load full-precision (FP) model.
- Collect activation maxima 4 from 16–64 random calibration prompts.
- Compute DeltaLoss sensitivities (55–10 min per 8B model).
- Solve for per-layer bit assignment.
- Pre-tune quantization scales by grid search on Eq. 6.
- Run per-block SignRound tuning (62–3 h per 70B model).
- Export quantized model for inference.
5.2 Deployment and Performance
- Weight memory: W2A16 mode yields 8× reduction; MXFP4/8 mode achieves 4–2× reduction.
- Inference speed: Quantized matmul kernels (ADSO, standard INT) provide near 2–4× latency improvements on GPU/CPU.
- Resource constraints: 70B models fit in 7GB VRAM (W2A16) with 8GB peak overhead for DeltaLoss.
A plausible implication is that SignRoundV2 enables practical PTQ of large-scale LLMs on commodity hardware by combining first-order accuracy maintenance and computational efficiency.
6. Significance and Future Directions
SignRoundV2 establishes two major contributions for low-bit LLM quantization: (1) a scalable, gradient-informed sensitivity metric (DeltaLoss) that guides allocation, and (2) efficient pre-tuning that substantially improves scale initialization, both of which lead to robust quantization even in extremely low-bit regimes. Its methodology generalizes to a range of LLM architectures and could be further extended by investigation into alternative sensitivity metrics, broader calibration strategies, or adaptation to even lower resource targets (Cheng et al., 4 Dec 2025).