SignRoundV2: Low-Bit LLM Quantization

Updated 5 December 2025

SignRoundV2 is a post-training quantization framework that enables extremely low-bit (2–5 bits) quantization of large language models with nearly full-precision accuracy.
It introduces a gradient-informed DeltaLoss metric for layer sensitivity and a calibration-based scale search to optimize bit allocation and initialization.
SignRoundV2 demonstrates robust, production-grade performance—recovering 95–99% of full-precision accuracy—while reducing memory footprint and inference latency.

SignRoundV2 is a post-training quantization (PTQ) framework designed to enable extremely low-bit quantization (2–5 bits) of LLMs with minimal loss in accuracy relative to full-precision baselines. It introduces two principal innovations: a first-order, gradient-informed layerwise sensitivity metric (“DeltaLoss”) that directs bit allocation, and a lightweight, calibration-based search for optimal scale initialization prior to quantization. SignRoundV2 demonstrates production-grade performance—within ~1% of full-precision—in the 4–5 bit regime and delivers strong results even with 2-bit weight quantization, advancing the state of efficient LLM deployment (Cheng et al., 4 Dec 2025).

1. Mathematical Formulation

SignRoundV2 employs a symmetric, scale-driven quantizer for both weights and activations. For a full-precision tensor $x$ , quantization bit-width $b$ , and scale $s$ , the quantize-dequantize operator is defined as:

$qdq(x; b, s) = s \cdot \mathrm{clip} \left( \left\lfloor \frac{x}{s} \right\rceil; n, m \right)$

where $\lfloor \cdot \rceil$ is round-to-nearest; $\mathrm{clip}(\cdot; n, m)$ saturates to $[n, m]$ with $n = -2^{b-1}$ , $m = 2^{b-1}-1$ ; and $s = \frac{\max(x) - \min(x)}{2^{b-1}}$ (Equation 2).

SignRoundV2 maintains compatibility with the trainable offsets ( $b$ 0) introduced in SignRound V1 while focusing on initialization and bit allocation. For each weight tensor $b$ 1, the quantization deviation is defined:

$b$ 2

The sensitivity of each layer $b$ 3 is estimated via a first-order Taylor expansion as:

$b$ 4

where $b$ 5, $b$ 6, $b$ 7, $b$ 8 are full-precision and quantized activations, $b$ 9, $s$ 0 are full-precision and quantized weights (Equation 3). In practice, activation errors dominate, leading to the DeltaLoss sensitivity metric:

$s$ 1

as in Equation 4.

2. Algorithmic Components

2.1 Layer-wise Bit Allocation

SignRoundV2 formulates a 0–1 integer program to minimize aggregate layerwise DeltaLoss under an average bit-width constraint. Given $s$ 2 layers, allowed bit-widths $s$ 3, and a global average-bit target $s$ 4, the optimization is:

$s$ 5

subject to

$s$ 6

where $s$ 7 is the parameter count in layer $s$ 8. This is efficiently solved using dynamic programming in $s$ 9 time.

Step	Description	Time Complexity
DeltaLoss computation	Compute $qdq(x; b, s) = s \cdot \mathrm{clip} \left( \left\lfloor \frac{x}{s} \right\rceil; n, m \right)$ 0 for all $qdq(x; b, s) = s \cdot \mathrm{clip} \left( \left\lfloor \frac{x}{s} \right\rceil; n, m \right)$ 1	$qdq(x; b, s) = s \cdot \mathrm{clip} \left( \left\lfloor \frac{x}{s} \right\rceil; n, m \right)$ 2
DP bit allocation	Optimize per-layer bits under global constraint	$qdq(x; b, s) = s \cdot \mathrm{clip} \left( \left\lfloor \frac{x}{s} \right\rceil; n, m \right)$ 3

2.2 Lightweight Pre-Tuning Scale Search

A calibration-driven, grid-based search finds an effective scale $qdq(x; b, s) = s \cdot \mathrm{clip} \left( \left\lfloor \frac{x}{s} \right\rceil; n, m \right)$ 4 for each layer before any gradient-based tuning. The initialization objective is:

$qdq(x; b, s) = s \cdot \mathrm{clip} \left( \left\lfloor \frac{x}{s} \right\rceil; n, m \right)$ 5

(Equation 6), where $qdq(x; b, s) = s \cdot \mathrm{clip} \left( \left\lfloor \frac{x}{s} \right\rceil; n, m \right)$ 6 collects per-input-channel activation maxima from a small calibration set. Candidate scales $qdq(x; b, s) = s \cdot \mathrm{clip} \left( \left\lfloor \frac{x}{s} \right\rceil; n, m \right)$ 7 are scanned, and the minimizer is chosen as $qdq(x; b, s) = s \cdot \mathrm{clip} \left( \left\lfloor \frac{x}{s} \right\rceil; n, m \right)$ 8. This is optionally refined by a trainable scalar $qdq(x; b, s) = s \cdot \mathrm{clip} \left( \left\lfloor \frac{x}{s} \right\rceil; n, m \right)$ 9 for fine-tuning.

2.3 Tuning Procedure

Each transformer block undergoes 200 sign-gradient descent steps (or up to 500 in extended “Ours*” experiments) on a blockwise reconstruction MSE loss. The learning rate is $\lfloor \cdot \rceil$ 0, with batch size 8 and sequence length 2048, using mixed precision for improved computational throughput. To reduce outlier influence, the largest $\lfloor \cdot \rceil$ 1 of squared errors in each block are excluded.

3. Evaluation: Models, Benchmarks, and Results

3.1 Models and Tasks

SignRoundV2 has been evaluated on LLaMA 2 (7B, 13B, 70B), LLaMA 3 (8B, 70B), and Qwen (2.5B, 8B, 32B) models. The benchmark suite includes ARC-Easy, ARC-Challenge, BoolQ, HellaSwag, LAMBADA, MMLU, OpenBookQA, PIQA, TruthfulQA, and WinoGrande.

3.2 Quantitative Performance

For 2-bit weights (W2A16):

Method	LLaMA2-7B	LLaMA2-13B	LLaMA2-70B
GPTQ (W2A16)	41.6%	48.3%	34.4%
AWQ (W2A16)	34.7%	36.0%	35.5%
OmniQ (W2A16)	47.0%	53.6%	54.9%
SignRound V1	54.5%	60.7%	67.7%
SignRound V2	57.9%	61.9%	68.4%

At 4–5 bits average (MXFP4/8), SignRoundV2 achieves 95–99% recovery of full-precision accuracy, sustaining $\lfloor \cdot \rceil$ 2 variance.

4. Ablation Studies and Comparative Analysis

4.1 Initialization

Initialization with scale pre-tuning yields gains of 5–10 percentage points in average accuracy over “without init” baselines: Qwen3-8B improves from 48–54% to 56–66%; LLaMA3.1-8B from 48–53% to 50–60%.

4.2 DeltaLoss-Only vs. Full Tuning

The DeltaLoss-only mode (no gradient-based SignRound tuning) already surpasses heuristic methods such as “head-8bit,” “tail-8bit,” and RTN. Full SignRoundV2 further improves accuracy by approximately 1–2 percentage points due to sign-gradient rounding.

4.3 Mixed vs. Uniform Precision

In pure 2-bit (W2A16) mode, uniform-precision SignRoundV2 nearly closes the gap with mixed-precision setups. For 4–5 bits, uniform allocation plus SignRoundV2 achieves $\lfloor \cdot \rceil$ 3 recovery; mixed-precision yields only marginal gains.

5. Implementation and Practical Considerations

5.1 Pipeline and Resource Profile

SignRoundV2 is available as open source at https://github.com/intel/auto-round, offering routines for DeltaLoss computation, dynamic bit allocation, pre-tuning, and blockwise SignRound training. Default hyperparameters include 200 steps per block, batch size 8, and 128 calibration samples. The pipeline, for one LLM instance, typically proceeds as follows:

Load full-precision (FP) model.
Collect activation maxima $\lfloor \cdot \rceil$ 4 from 16–64 random calibration prompts.
Compute DeltaLoss sensitivities ( $\lfloor \cdot \rceil$ 55–10 min per 8B model).
Solve for per-layer bit assignment.
Pre-tune quantization scales by grid search on Eq. 6.
Run per-block SignRound tuning ( $\lfloor \cdot \rceil$ 62–3 h per 70B model).
Export quantized model for inference.

5.2 Deployment and Performance

Weight memory: W2A16 mode yields 8× reduction; MXFP4/8 mode achieves 4–2× reduction.
Inference speed: Quantized matmul kernels (ADSO, standard INT) provide near 2–4× latency improvements on GPU/CPU.
Resource constraints: 70B models fit in $\lfloor \cdot \rceil$ 7GB VRAM (W2A16) with $\lfloor \cdot \rceil$ 8GB peak overhead for DeltaLoss.

A plausible implication is that SignRoundV2 enables practical PTQ of large-scale LLMs on commodity hardware by combining first-order accuracy maintenance and computational efficiency.

6. Significance and Future Directions

SignRoundV2 establishes two major contributions for low-bit LLM quantization: (1) a scalable, gradient-informed sensitivity metric (DeltaLoss) that guides allocation, and (2) efficient pre-tuning that substantially improves scale initialization, both of which lead to robust quantization even in extremely low-bit regimes. Its methodology generalizes to a range of LLM architectures and could be further extended by investigation into alternative sensitivity metrics, broader calibration strategies, or adaptation to even lower resource targets (Cheng et al., 4 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

SignRoundV2: Closing the Performance Gap in Extremely Low-Bit Post-Training Quantization for LLMs (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SignRoundV2.