Learned Slimmable Training Methods

Updated 6 February 2026

Learned Slimmable Training is a dynamic method that creates a single neural network capable of operating at various channel widths with shared weights and switchable batch normalization.
The approach employs advanced sampling techniques and in-place distillation to align sub-network performances with individually trained models.
Extensions include pruned networks, transformer adaptations, and dynamic gating to enable real-time adjustments across diverse tasks and hardware constraints.

A learned slimmable training methodology constructs a single neural network capable of dynamically executing at varying channel widths (and, in advanced forms, depths or expert counts), such that each sub-network’s accuracy and efficiency reliably approximate (or exceed) individually trained networks at corresponding capacities. The central innovation is the learning protocol that permits all sub-networks to share weights while preserving the fidelity of batch normalization statistics and aligning feature or output distributions via explicit mechanisms during joint optimization. This approach enables real-time accuracy–efficiency trade-offs across diverse deployment scenarios without retraining or architecture search.

1. Fundamental Mechanism of Slimmable Networks

The generic formulation of a slimmable network $M$ comprises a shared set of convolutional, linear, and other operator parameters, with $K$ selectable sub-network configurations indexed by width multipliers $\alpha_k\in\{\alpha_1,\ldots,\alpha_K\}$ . For each width $\alpha_k$ , each layer activates only the first $\lceil \alpha_k n_\ell \rceil$ channels (for layer $\ell$ with $n_\ell$ full-width channels). All non-normalization weights are shared across widths. The core structural element enabling correctness is switchable batch normalization (S-BN): each width $\alpha_k$ has its own set of batch norm statistics $(\mu_c^{(k)}, \sigma_c^{2(k)}, \gamma_c^{(k)}, \beta_c^{(k)})$ for each channel $c$ , with all other weights shared. At inference, the model simply selects $\alpha_k$ , loads the corresponding BN parameters, and operates at a reduced memory/FLOPs footprint (Yu et al., 2018).

2. Learned Slimmable Training Algorithm

The canonical learned slimmable training process for classification proceeds as:

Outer iteration: For each mini-batch $(x, y)$ , iterate over all width multipliers $\{\alpha_1,\ldots,\alpha_K\}$ (or, for US-Nets, a sampled subset).
Width switch: For each $\alpha_k$ , activate only the first $\lceil \alpha_k n_\ell \rceil$ channels per layer.
BN privatization: Switch BN to private parameters for $\alpha_k$ (S-BN).
Forward and backward: Compute output $y^{(k)} = M(x; \alpha_k)$ , loss $L^{(k)} = \ell(y^{(k)}, y)$ , accumulate gradients over all $k$ .
Update: Take a single optimizer step for all shared weights and per-width BN parameters.

The total training objective is the unweighted sum (or, in US-Nets or SlimCLR, possibly weighted by width or FLOPs) of switch-specific losses:

$L_{\mathrm{total}} = \sum_{k=1}^K L^{(k)}(W, B^{(k)}) = \sum_{k=1}^K \ell\left( M(x;\alpha_k), y \right)$

This approach is empirically stable and maintains per-width performance parity or advantage relative to independently trained baselines. In-place distillation (Section 4) and specific sampling strategies further enhance the training regime [(Yu et al., 2018), (Yu et al., 2019), (Zhao et al., 2022)].

3. Advanced Sampling, In-Place Distillation, and Training Techniques

Universally Slimmable Networks (US-Nets) introduce three key advances (Yu et al., 2019):

Sandwich Rule: For each mini-batch, always include the minimal width $w_\mathrm{min}$ , the maximal width $w_\mathrm{max}$ , plus randomly sampled intermediates to cover the whole interval $[w_\mathrm{min},w_\mathrm{max}]$ . This ensures the hardest-to-optimize endpoints are always addressed and intermediate performance is bounded.
In-Place Distillation: After computing outputs for $w_\mathrm{max}$ , treat its logits with stop-gradient as soft targets for the remaining sub-nets (with temperature scaling). Distillation losses of the form $\ell_{\mathrm{CE}}(f(x; w), \textrm{softmax}(z_\mathrm{max}/T))$ are summed with standard targets, improving sub-network accuracy by up to 1% top-1.
BatchNorm Calibration: Rather than storing all possible BN statistics for a continuum of widths, post-training calibration is run per width to accumulate accurate statistics over a small calibration set, yielding negligible error (<0.1%) versus true batch statistics.

Variants such as SlimCLR adapt self-supervised contrastive frameworks by introducing slow-start schedules (initial training on full-width only), online distillation from full to slim sub-nets, and switchable linear probes for downstream evaluation to counteract gradient magnitude imbalance and direction divergence (Zhao et al., 2022). DSPNet and learned slimmable semseg/detection methods further tailor the knowledge transfer mechanism and architecture masking to task [(Wang et al., 2022), (Xue et al., 2022)].

4. Extensions: Pruned, Transformer, MoE, and Dynamic Slimmable Networks

Learned slimmable training has diversified into multiple architectural and task domains:

Pruned Slimmable Networks (SP-Net): Rather than uniform width multipliers, per-layer non-uniform pruning is learned (using multi-base, one-shot pruned sub-nets, slimmable channel sorting, and zero padding match for residuals), achieving superior performance/FLOPs trade-offs compared to both slimmable and SOTA pruning/NAS methods (Kuratsu et al., 2022).
Bilaterally Slimmable Transformers (BST): Transformer-based models become elastic along both width (hidden size, head count) and depth via a “slim-all” and “slim-middle” strategy, with a triangle-filtered submodel configuration set and a per-batch multi-submodel KL-divergence distillation loss (Yu et al., 2022).
Self-slimmable Sparse MoEs: By freezing a random router, increasing the top- $k$ expert count during training, and omitting router learning and balancing losses, sparse Mixture-of-Experts models gain slimmable properties, enabling seamless inference trade-offs and outperforming dense or standard MoEs on several benchmarks (Chen et al., 2023).
Dynamic Slimmable Networks: Learned gates are trained (often after progressive sub-net search) to route data samples to optimized sub-networks given input complexity, maximizing performance per FLOP dynamically (Jiang et al., 2021).
Self-supervised Universally Slimmable Learning: Gradient instability in naïve SSL slimmable recipes is addressed via three guidelines: relative-distance-form (InfoNCE or CE) losses for both base and distillation, and momentum teachers to stabilize targets. Dynamic sampling and group regularization improve learning efficiency and late-channel utilization (Cao et al., 2023).

5. Empirical Results and Performance Trade-offs

Slimmable networks and their extensions routinely match or exceed the accuracy of individually trained models for each width across benchmarks:

Model Variants	Top-1 Error ∆ vs. Individual	FLOPs Reduction	Training Overhead
S-Net (MobileNet)	Comparable or better	Up to 4×	1× (naive); 2× (SP-Net)
US-Net	<1% better @ small widths	Any (continuous)	≈1× (plus calibration)
SP-Net	+1–4% vs. S-Net, ≈NAS	>2×	2× S-Net
BST (MCAN)	≤0.6 pp loss at 3.7× ↓ FLOPs	Up to 6×	Single unified run
SMoE-Dropout	+0.5–5% accuracy over dense	Linear in $k$	–21–37% training time

Empirical evidence shows that slimmable training imbues implicit knowledge distillation from large to small sub-networks, contributing to stable rank order and accurate representations for all supported capacities [(Yu et al., 2018), (Yu et al., 2019), (Kuratsu et al., 2022), (Yu et al., 2022), (Chen et al., 2023)].

6. Open Challenges and Design Considerations

Gradient Interference: Sub-network gradients can conflict, especially in self-supervised training, requiring loss reweighting, slow-start schedules, or dedicated matching objectives [(Zhao et al., 2022), (Cao et al., 2023)].
Feature Distribution Mismatch: BN privatization is not scalable to an uncountable set of widths, motivating calibration or alternate normalization strategies (Yu et al., 2019).
Implementation Complexity: Pruned slimmable variants (SP-Net) and dynamic gating require channel sorting, mask management, and careful maintenance of memory-contiguous accesses to preserve inference speed [(Kuratsu et al., 2022), (Jiang et al., 2021)].
Task-specific Adaptation: Semantic segmentation and detection tasks benefit from auxiliary distillation (stepwise downward KD) and boundary-aware objectives to compensate for the increased difficulty of transferring knowledge to lower-capacity submodels (Xue et al., 2022).

7. Significance and Application Domains

Learned slimmable training has enabled practical deployment of adaptive neural networks across mobile, embedded, and server environments, with single checkpoint models spanning a wide dynamic range of compute/accuracy trade-offs. The framework has extended from image classification to super-resolution, object detection, semantic segmentation, visual question answering, and language modeling as sparse MoEs, as well as to self-supervised and dynamic routing paradigms [(Yu et al., 2018), (Yu et al., 2019), (Chen et al., 2023), (Yu et al., 2022), (Zhao et al., 2022), (Xue et al., 2022), (Cao et al., 2023)].

A plausible implication is continual broadening of slimmable/elastic training protocols to more modalities and tasks, further reducing training time, storage, and deployment barriers associated with model selection for varying hardware constraints, and enabling more sophisticated forms of dynamic inference.