One-Shot Magnitude Pruning

Updated 9 February 2026

One-shot magnitude pruning is a model compression technique that ranks weights by their absolute values and prunes those below a global threshold after full training.
Multi-rate and sensitivity-aware variants enhance this method by dynamically adjusting thresholds and incorporating gradient information for improved sparsity and accuracy.
Empirical results show that one-shot MP achieves competitive accuracy at high sparsity levels with reduced computational overhead compared to iterative pruning methods.

One-shot magnitude pruning (MP) is a class of model compression techniques in which a neural network, once fully trained, is subjected to a single pass of parameter elimination based purely on the magnitude of its weights. The essential procedure involves ranking all parameters (or structural units, e.g., channels or filters) by an absolute-value importance score, selecting a global pruning threshold to induce a specified sparsity, and then zeroing out those weights below this threshold. In contrast to iterative pruning—which alternates between weight removal and retraining—one-shot MP executes a solitary global thresholding and, optionally, a final fine-tuning phase. This approach, including recent multi-rate and sensitivity-aware variants, is widely adopted across convolutional, recurrent, and graph neural networks, offering a computationally efficient pathway to substantial model sparsity with frequently strong accuracy retention.

1. Mathematical Formulation of One-Shot Magnitude Pruning

The canonical one-shot MP recipe assigns an importance score $I(w_i) = |w_i|$ to each parameter $w_i$ of a network trained to convergence. Given a desired sparsity $s \in (0,1)$ , all weights are globally ranked by $|w_i|$ and a threshold $\theta$ is determined as the $s$ -quantile:

$\theta = \min\left\{ t : \left|\{ i : |w_i| < t \}\right| \ge s \cdot N \right\}$

where $N$ is the total number of parameters. The pruned network is formed by applying the binary mask $m_i = 1[|w_i| \geq \theta]$ to all $w_i$ , yielding pruned weights $w^*_i = w_i m_i$ (Gupta et al., 2022, Janusz et al., 19 Aug 2025).

In structured pruning, the same principle applies to units such as filters or channels, where the magnitude score becomes $\|W_i\|_2$ (the $\ell_2$ norm of unit $i$ ). The resulting empirically sparse network is then ready for optional fine-tuning with the same mask reapplied throughout (Janusz et al., 19 Aug 2025).

2. Variants and Extensions: Multi-Rate, Sensitivity, Uncertainty

Multi-Rate One-Shot Pruning

Recent developments extend one-shot MP to support multi-rate sparsity extraction. For example, multi-rate magnitude pruning (MRMP) defines a continuous, smooth relaxation of the binary mask using a band-stop function,

$m_{ij} := \psi_{a,\sigma}(\hat{w}_{ij}) = (1 + \sigma\, \exp(a^2 - \hat{w}_{ij}^2))^{-1}$

where $a$ parameterizes the threshold corresponding to the chosen sparsity under a prescribed prior $P$ for the weight distribution (e.g., Gaussian or Laplace). MRMP enforces distributional alignment $D_{KL}(P \| Q)$ , allowing extraction of masked subnetworks at arbitrary sparsity post-training—either at pre-specified rates or via extrapolation—without retraining. Once the network is trained, any desired sparsity is realized by thresholding the latent weights according to the prior’s quantile function $a^* = F_P^{-1}(r^*)$ (Sahbi, 2023).

Sensitivity-Aware and Regularization Perspectives

Sensitivity-aware one-shot pruning augments the magnitude ranking with diagnostic scores that incorporate gradient and Fisher information, defining a composite per-weight sensitivity $\Sigma_i$ and a reweighted score:

$\text{score}_i = \frac{|w_i|}{1 + \lambda \Sigma_i}$

where $\Sigma_i$ combines normalized gradient and Fisher estimates, and $\lambda$ modulates sensitivity emphasis. This approach targets more aggressive pruning in redundancy-rich components (e.g., decoder self-attention in ASR) and empirically yields both regularization benefits and higher compression rates (Irigoyen et al., 11 Nov 2025).

One-shot MP also serves as a deterministic capacity-control regularizer, effectively imposing an $\ell_0$ constraint that improves generalization in several domains, notably when pruning is allocated based on component-level sensitivities (Irigoyen et al., 11 Nov 2025).

Statistical Guarantees and Uncertainty Quantification

Extensions based on uncertainty quantification, such as the Learn-then-Test (LTT) framework, enable distribution-free statistical calibration of the pruning ratio. By controlling the family-wise error rate (FWER) via super-uniform $p$ -values for each candidate sparsity, one-shot MP attains rigorous guarantees on maximum allowed loss, supporting both labeled and unlabeled settings. The highest empirically validated $\lambda$ at which $R(\lambda)$ stays within user-tolerated degradation $\alpha$ is adopted (Alvarez, 2024).

3. Practical Algorithms and Pseudocode

The core workflow for one-shot MP can be summarized as follows (Gupta et al., 2022, Janusz et al., 19 Aug 2025):

Train the network to full convergence with standard (or variance-regularized) loss.
Rank all weights (or structural units) by $|w_i|$ (or $\|W_i\|_2$ ).
Determine the global (or per-component) threshold for the target sparsity, optionally enforcing an additional minimal per-layer keep (Minimum Threshold, MT).
Mask/prune all weights below the threshold.
Optionally, fine-tune the resulting network with a fixed mask.

A minimal pseudocode for unstructured MP is:

inputs: W (weights), s (target sparsity), E (fine-tune epochs)

theta = np.quantile(np.abs(W), s)
mask = (np.abs(W) >= theta).astype(float)
W_pruned = W * mask

for epoch in range(E):
    # gradient update on W_pruned using fixed mask
    ...

(Gupta et al., 2022, Janusz et al., 19 Aug 2025).

For multi-rate MRMP, a single training produces a latent weight tensor; any desired pruning fraction is extracted by (i) thresholding at $a^* = F_P^{-1}(r^*)$ , (ii) forming a hard mask, and (iii) outputting sparse weights $\hat{w} \odot m$ (Sahbi, 2023).

4. Empirical Performance and Analysis

Extensive benchmarking demonstrates that one-shot MP matches or outperforms more complex state-of-the-art (SOTA) pruning methods at moderate-to-high sparsity:

On CIFAR-10 with WRN-28-8 at 90% sparsity, one-shot global MP achieves 96.30% test accuracy, surpassing all SOTA comparators; similar trends hold at 95% sparsity (Gupta et al., 2022).
On ImageNet, ResNet-50 at 90% sparsity: one-shot MP yields 75.28% top-1 accuracy; at 95%, 71.56% (slightly lower than gradual MP's 72.14%, but above STR baseline at 70.4%) (Gupta et al., 2022).
Fine-grained sensitivity-aware pruning enables aggressive compression in ASR without fine-tuning: e.g., pruning 50% of decoder self-attention in Whisper-small reduces WER by 2.38% absolute (20.44% relative) on LibriSpeech test-other (Irigoyen et al., 11 Nov 2025).
In GNNs, one-shot MP followed by denoising and repowering attains 1.3%–45.6% higher weight sparsity and 7.5%–22.7% higher graph sparsity than IMP-based baselines, with $1.7$– $44\times$ speedup and $95.3\%$ – $98.6\%$ MAC savings (Yue et al., 2024).
Calibrated statistical pruning, as on MNIST, supports $>70\%$ sparsity at $\leq 3\%$ error degradation, with rigorous FWER control (Alvarez, 2024).

A composite table summarizing select results:

Model/Dataset	Sparsity	Accuracy (MP)	Baseline/SOTA	Reference
WRN-28-8 (CIFAR-10)	90%	96.30%	96.08% (DPF)	(Gupta et al., 2022)
ResNet-50 (ImageNet)	95%	71.56%	70.40% (STR)	(Gupta et al., 2022)
Whisper-small (LibriSpeech; Dec SA)	50%	9.26 (WER)	11.64 (unpruned)	(Irigoyen et al., 11 Nov 2025)
GCN/GIN/GAT (GLT)	1.3–45%	$\uparrow$	IMP approaches	(Yue et al., 2024)
MNIST (FC)	70–78%	≤3–5% err	Statistical cert.	(Alvarez, 2024)

5. Theoretical Perspectives and Limitations

One-shot MP’s practical success is grounded in the empirical observation that magnitude is a sufficient proxy for importance in most over-parameterized architectures. For high sparsity, catastrophic performance drops may be caused not by suboptimal weight ranking, but by collateral effects such as "signal collapse"—the loss of activation variance across layers leading to poor information propagation. This is especially pronounced in very deep networks at aggressive sparsity levels (Saikumar et al., 18 Feb 2025). The REFLOW method addresses this bottleneck by recalibrating batch normalization (BN) statistics after MP: recalibration of the BN running mean and variance restores activation variance and enables dramatic accuracy recoveries (e.g., raising ResNeXt-101 top-1 accuracy from $<4.1\%$ to $78.9\%$ at 20% density) without updating the weights themselves (Saikumar et al., 18 Feb 2025).

The performance of magnitude-based pruning is also enhanced by training-time modifications that restructure the weight distribution, such as the Weight Variance Amplifying Regularizer (VAR) (Yun et al., 18 Nov 2025). Such regularization increases layerwise parameter variance, ensuring a clearer separation between near-zero and large weights, yielding superior post-prune accuracy, particularly in very high-sparsity regimes.

6. Comparative Assessment: One-Shot vs. Iterative Pruning

Systematic comparisons reveal that one-shot MP is preferable when compression ratios are moderate (sparsity $\leq80\%$ ): it is computationally efficient, algorithmically simple, and achieves accuracy competitive with more sophisticated iterative or second-order schemes (Gupta et al., 2022, Janusz et al., 19 Aug 2025). At extreme sparsities ( $>80$ – $90\%$ ), iterative geometric pruning or a hybrid "few-shot" approach—one-shot to reach bulk sparsity, followed by gentle iterative steps—may reclaim additional accuracy. For practical deployment, one-shot MP remains the method of choice for most settings; careful implementation (e.g., minimum per-layer thresholds and post-prune normalization) further extends its robustness (Gupta et al., 2022, Janusz et al., 19 Aug 2025).

7. Implementation, Limitations, and Best Practices

Implementation of one-shot MP is trivial in major deep learning frameworks, relying only on standard sorting/indexing operations and—optionally—incorporation of per-layer minimum thresholds or sensitivity-augmented scores. The method is model-agnostic and applies to convolutional, recurrent, transformer, and GNN architectures. Sensitivity-aware and MRMP-based variants require precomputation of sensitivity or distributional alignment during or after the main training phase (Sahbi, 2023, Irigoyen et al., 11 Nov 2025).

Best practices include:

Using global magnitude ranking for unstructured MP, with per-layer quotas for structured pruning where necessary to prevent layer collapse (Gupta et al., 2022, Janusz et al., 19 Aug 2025).
Applying batch normalization recalibration (REFLOW or equivalent) to avoid signal collapse in high-sparsity settings (Saikumar et al., 18 Feb 2025).
For uncertainty-sensitive applications, calibrating the maximal safe pruning rate using distribution-free FWER control (Alvarez, 2024).
Utilizing variance-regularized training (VAR) or multi-rate strategies (MRMP) for networks deployed at multiple or extreme sparsity levels (Yun et al., 18 Nov 2025, Sahbi, 2023).
For domain-specific regularization gains, leveraging component- or sensitivity-guided MP, especially where pruning uncovers underlying redundancies (e.g., self-attention in speech recognition) (Irigoyen et al., 11 Nov 2025).

One-shot magnitude pruning remains a central, high-impact methodology for neural network compression, striking a favorable balance between simplicity, theoretical interpretability, practical effectiveness, and extensibility to advanced settings such as multi-rate and sensitivity-aware pruning (Gupta et al., 2022, Sahbi, 2023, Irigoyen et al., 11 Nov 2025, Yun et al., 18 Nov 2025, Alvarez, 2024, Yue et al., 2024, Saikumar et al., 18 Feb 2025, Janusz et al., 19 Aug 2025).