Strong Lottery Ticket Hypothesis
- The Strong Lottery Ticket Hypothesis shows that every sufficiently large network holds a sparse subnetwork that can approximate a target network without modifying weights.
- It leverages deep probabilistic combinatorics and the random subset sum problem to provide explicit bounds on overparameterization and sparsity via entry-wise pruning.
- Extensions to transformer, quantized, and quantum architectures underscore SLTH's practical impact despite inherent NP-hard challenges in optimal subnetwork discovery.
The Strong Lottery Ticket Hypothesis (SLTH) formalizes the claim that every sufficiently large, randomly initialized neural network contains a sparse subnetwork which, without any weight updates or additional training, can approximate the function of an arbitrary, much smaller target network to high accuracy. Originating as a theoretical strengthening of the empirical Lottery Ticket Hypothesis, the SLTH has been rigorously established across fully connected, convolutional, equivariant, quantized, transformer, and even quantum circuit architectures. Core technical results rely on deep probabilistic combinatorics—especially the random subset sum problem and its restricted variants—to show that with high probability, appropriate pruning can extract “winning tickets” from a large random network, even prior to any supervised optimization. A central theme in the SLTH literature is quantifying the necessary degree of over-parameterization, sparsity of extracted subnetworks, and the optimality of these trade-offs.
1. Formal Framework and Main Theorems
Let be a depth- ReLU network of width with layerwise spectral norm and entrywise bounds. For any approximation error and failure probability , consider a depth-$2l$ network of width with i.i.d.\ weights. The foundational result (Malach et al., 2020) establishes that, with probability , there exists a weight subnetwork (formed by masking 's weights) such that
and the number of nonzeros in is . The pruning is entry-wise, not limited to whole neurons. Notably, no training whatsoever is required; the subnetwork is found purely by pruning.
This existential result holds for general depth- ReLU nets, and the bounds on overparameterization and the size of the subnetwork are explicit and polynomial in relevant parameters. The proof iterates the “two-layers-for-one” construction (see Lemmas 3.1–3.4 in (Malach et al., 2020)) which assembles approximations for each unit and each layer using the subset-sum principle, then stacks the approximators to cover the full network.
2. Pruning, Sparsity, and Subnetwork Construction
The SLTH specifies weight subnetworks formed via binary masks applied entrywise to each weight matrix. The key concept is that, by appropriate selection of the mask, even a random, highly overparameterized network already contains all the necessary degrees of freedom to uniformly approximate any bounded-weight target network.
Quantitative bounds on sparsity for the strong tickets are derived from advances in the Random Fixed-Size Subset Sum (RFSS) problem (Natale et al., 2024). For a -layer network, the minimal width per layer required to prune down to density and still achieve -approximation is
where is the target network’s parameter count and is the binary entropy function. Consequently, at moderate sparsity, both overparameterization and subnetwork size grow only polylogarithmically in and (Natale et al., 2024). Explicit bounds for G-equivariant networks, multi-head attention, and transformers are analogous in form, often adding minor polynomial factors in architecture-specific parameters (Otsuka et al., 6 Nov 2025).
3. Extensions: ε-Perturbations, Quantization, and Architectural Generalization
A robust line of research investigates allowing the initial random weights to move within an -scale perturbation ball before pruning, blurring the line between “train-free” and training-based lottery tickets (Xiong et al., 2022). The main result shows that overparameterization requirements diminish smoothly as increases—formally, required width per target parameter for error . Empirically, small amounts of projected SGD before pruning already achieve this regime, reducing the burden on architectural size while retaining strong ticket guarantees.
Quantized settings are addressed in (Kumar et al., 14 Aug 2025), which establishes that for -quantized target networks, a random -quantized network of width and depth suffices to recover any discrete target network exactly via pruning and quantization, with information-theoretic optimality up to constants.
Architectural generalization includes:
- Transformers: The SLTH is shown to hold for multi-head attention (MHA) if hidden key/value dimensions are used. Extensions to full transformers (without normalization) aggregate per-block errors, maintaining overall control with only log-overparameterization (Otsuka et al., 6 Nov 2025).
- Equivariant and convolutional networks: Generalizations leverage equivariant basis expansions, yielding similar sparsity and overparameterization trade-offs (Natale et al., 2024).
- Variational quantum circuits: The SLTH is extended to binary VQCs, where combinatorial mask search (e.g., with evolutionary algorithms) discovers random subcircuits achieving perfect task performance without parameter tuning (Kölle et al., 14 Sep 2025).
4. Algorithmic Methods and Computational Complexity
Although all foundational theorems are existential and non-constructive—they guarantee the existence but not efficient identification of strong lottery tickets—recent work has made progress on algorithmic approaches. Brute-force and greedy heuristics are intractable for large models due to the NP-hardness of the combinatorial subnetwork search (Malach et al., 2020). Gradient-based pruning methods (e.g., edge-popup), randomized masking, and more recently, genetic algorithms and evolutionary search have been benchmarked (Altmann et al., 2024, Kölle et al., 14 Sep 2025). On small binary tasks, pure genetic algorithms—without any gradient signal—can find strong tickets matching or occasionally exceeding the performance of gradient-optimized subnetworks, supporting the practical relevance of the existential theory (Altmann et al., 2024).
A summary table of algorithmic strategies is given below:
| Method | Requires Gradient? | Success (binary tasks) |
|---|---|---|
| Edge-popup | Yes | Good at moderate sparsity |
| Genetic Algorithm (GA) | No | Near-optimal at low class count |
| Evolutionary Search VQC | No | 100% on binary tasks |
Algorithmic efficiency and scalability to large-scale networks or multi-class tasks remain open challenges.
5. Parameter-Sign and Basin Connectivity: Toward Transferable Winning Tickets
Recent advances show that the critical configuration is not merely sparsity but the sign pattern (and, in normalization architectures, norm-layer parameters) of the discovered subnetwork. The “A Winning Sign” (AWS) method demonstrates that, after pruning and sign extraction, applying the signed mask to a fresh initialization gives a subnetwork with nearly identical performance and low linear-mode-connectivity barriers to the original trained ticket (Oh et al., 7 Apr 2025). This finding extends the SLTH: the basin associated with the winning subnetwork becomes robustly transferable across all initializations so long as the sign configuration is preserved, even as the weight magnitudes are randomized. AWS achieves this transferability by randomizing norm layer parameters during pruning, lowering loss barriers along linear paths in weight space. Empirical accuracy of the transferred signed tickets on ImageNet and CIFAR-100 remains nearly on par with dense baselines at high sparsity rates.
6. Technical Foundations: Random Subset Sum and Bounds
SLTH theory is underpinned by the random subset sum (RSS) and, crucially, the random fixed-size subset sum (RFSS) problem. These probabilistic-combinatorial lemmas quantify how many random samples are necessary to -cover an interval by subset sums (or fixed-size subset sums), dictating the required width of overparameterized layers to guarantee that arbitrary target weights can be matched (or closely approximated) upon pruning. In quantized settings, the number partitioning phase transition becomes sharp— samples allows for exact representation of all discrete targets (Kumar et al., 14 Aug 2025).
Proofs for deep networks recursively apply these subset sum lemmas, layering approximations for each unit and each layer, propagating union bounds to handle all neurons and all layers simultaneously. Main theorems (e.g., Theorem 1 in (Malach et al., 2020)) thereby guarantee uniform approximation over the entire input ball and all target network parameters.
7. Limitations, Open Issues, and Practical Reflections
Key known limitations include:
- Computational hardness: Finding the optimal subnetwork is NP-hard; efficient algorithms for high-dimensional, high-sparsity regimes remain elusive (Malach et al., 2020).
- Overparameterization scale: Polynomial dependence on input dimension (or log dependence in quantized/theoretically optimal settings) may be impractical for ultra-large-scale applications.
- Activation/architecture specificity: Theory is most mature for ReLU and standard initializations, with extensions to other activations, ResNet/convnets, or normalization-dependent architectures still developing (Natale et al., 2024, Oh et al., 7 Apr 2025).
- Extension beyond supervised settings: SLTH’s direct translation to reinforcement learning, unsupervised, or quantum learning remains nascent, though early positive results for VQCs are reported (Kölle et al., 14 Sep 2025).
- Empirical ticket-finding: Algorithmic approaches are robust for small or binary tasks but degrade for high-class counts or larger, more complex models (Altmann et al., 2024, Kölle et al., 14 Sep 2025).
Practical implications include rapid search for compact untrained subnetworks, new strategies for parameter-efficient distributed learning, and the potential to mitigate optimization pathologies (e.g., barren plateaus in VQCs) by preprocessing with structure-respecting pruning. Notably, in the quantized regime, pruning achieves exact recovery (not merely approximation) of all possible target networks of a given class (Kumar et al., 14 Aug 2025). Transferability of signed masks may enable initialization-independent sparse training protocols (Oh et al., 7 Apr 2025).
References:
(Malach et al., 2020, Xiong et al., 2022, Natale et al., 2024, Altmann et al., 2024, Oh et al., 7 Apr 2025, Kumar et al., 14 Aug 2025, Kölle et al., 14 Sep 2025, Otsuka et al., 6 Nov 2025)