Multiple Ticket Hypothesis

Updated 6 February 2026

Multiple Ticket Hypothesis is a framework in neural networks that asserts many distinct sparse subnetworks (tickets) can achieve near-dense performance with minimal overlap.
It employs diverse pruning methods—such as global magnitude, layerwise importance, and random masking—to systematically uncover combinatorially many effective tickets.
Empirical studies across supervised learning, reinforcement learning, and mechanism design show sparse tickets can mirror full model accuracy, enabling efficient ensembles and reduced computational costs.

The Multiple Ticket Hypothesis (MTH) generalizes the original Lottery Ticket Hypothesis (LTH) in overparameterized neural networks, asserting that there exist not a single, but numerous distinct sparse subnetworks (or "tickets") within a large model, each capable of attaining performance comparable to the full or dense model when trained or fine-tuned appropriately. These "multiple tickets" can be uncovered via different pruning, importance, or masking criteria and are typically nearly disjoint except for a small "core" of shared connections. The hypothesis is supported by empirical, theoretical, and algorithmic results across supervised, reinforcement learning, and even mechanism design contexts, demonstrating combinatorially large families of performant subnetworks under various selection schemes (Vandersmissen et al., 2023, Adewuyi et al., 2 Feb 2026, Guo, 2023, Kobayashi et al., 2022, Cheng et al., 2022).

1. Origins and Formalization

The Lottery Ticket Hypothesis was originally formulated to state that a randomly-initialized dense neural network contains a subnetwork ("winning ticket") which, when trained in isolation, can match the performance of the original full model. The Multiple Ticket Hypothesis extends this, formalizing that, for a fixed initialization, there are many distinct sparse tickets capable of reaching similar performance, with only minimal overlap in their parameter sets.

Formally, for network parameters $\theta_0 \in \mathbb{R}^d$ and binary mask $m\in\{0,1\}^d$ , the subnetwork $\theta_0 \odot m$ is a ticket if, after training (rewind or fine-tuning), it obtains accuracy within $\epsilon$ of the full network. MTH states that, at relevant sparsity levels, there exist a combinatorial number of such masks ( $m_1, m_2, \dots, m_k$ ), with low pairwise overlap, each yielding performant subnetworks (Vandersmissen et al., 2023, Adewuyi et al., 2 Feb 2026, Kobayashi et al., 2022).

2. Ticket Discovery and Diversity Mechanisms

Multiple approaches have been established for identifying different tickets, largely dependent on how importance or saliency is defined for pruning:

Global Magnitude Pruning: Uses $|\theta_{i,j}|$ as an importance measure across all weights (Vandersmissen et al., 2023).
Layerwise Normalized Importance: Employs $L_1$ -norm, $L_2$ -norm, softmax, or min–max scalings for fine-grained, context-aware importance criteria.
Random Masking: In RLVR and ensemble settings, purely random sparsity patterns are imposed, producing minimally overlapping tickets that still succeed (Adewuyi et al., 2 Feb 2026, Kobayashi et al., 2022).
Iterative Pruning with Regularization: IMP with additional $L_1$ penalties or with mask diversity constraints increases the diversity of discovered tickets.
Adaptive “Scratching” and Redraws: In worst-case mechanism design, tickets are drawn repeatedly, each time avoiding previously found “bad” tickets, to systematically explore the ticket space (Guo, 2023).

The discovery process often involves iterative cycles of training, pruning based on the chosen criterion, and optionally rewinding to earlier weights, repeated until the desired sparsity is achieved.

3. Theoretical Underpinnings

The explanation for MTH relies on overparameterization, geometry of the loss landscape, and in some applications, the structure of low-dimensional subspaces:

Sparse Subgraph Redundancy: In supervised learning, the existence of many disjoint subgraphs capable of reaching the same loss basin, provided a critical core of connections, leads to the observed multiple tickets (Vandersmissen et al., 2023).
Fisher Subspace in RLVR: For RLVR finetuning, the implicit KL-trust region restricts meaningful updates to a low-rank subspace ( $r \ll d$ ) of the parameter space. Because the dominant eigenvectors of the Fisher information are delocalized, almost any random subnetwork of size $m\in\{0,1\}^d$ 0 will intersect this subspace, enabling effective optimization and yielding exceedingly many possible tickets (Adewuyi et al., 2 Feb 2026).
Universal Approximation and Redundancy in Mechanism Design: A large ReLU-MLP contains many tiny subnetworks (pruned versions) capable of representing near-optimal solutions for mechanism design tasks; systematic exclusion and rediscovery are achieved by leveraging worst-case profiles (Guo, 2023).

4. Quantitative and Structural Insights

Empirical analyses across tasks and domains demonstrate key features:

Overlap Metrics: Typical (global) overlap for tickets discovered via global-magnitude, $m\in\{0,1\}^d$ 1, and $m\in\{0,1\}^d$ 2 norms is ≈0.33%, while the unpruned-only overlap is ≈9.34% at 96.48% sparsity, though some layers (initial and final) exhibit higher overlap (60–70%) (Vandersmissen et al., 2023).
Core Connections: A small, stable backbone is shared by all tickets; these connections tend to have more stable signs, and less variance in training trajectories.
Random Mask Performance: At up to 99.95% sparsity, randomly masked subnetworks (as few as 1% of parameters) in RLVR achieve or slightly exceed dense-network performance, with mean Jaccard overlaps $m\in\{0,1\}^d$ 3, i.e., near-zero overlap among successful tickets (Adewuyi et al., 2 Feb 2026).
Ensemble Diversity: Winning tickets discovered from a single model can have high mask overlap but still differ in their predictions, and increased diversity (measured by, e.g., disagreement D or Q-statistic) yields higher ensemble gains. Inducing more mask and prediction diversity through randomness or $m\in\{0,1\}^d$ 4 regularization increases the benefit further (Kobayashi et al., 2022).
Sensitive vs. Robust Discovery: Direct magnitude-based tickets can be sensitive to global sparsity targets, but adopting robustified schemes like power-propagation parameterizations broadens the range of tickets that remain performant and stable (Cheng et al., 2022).

5. Applications and Empirical Ranges

The MTH has been empirically validated in varying contexts:

Domain	Ticket Selection Method	Notable Empirical Results
Supervised (Image/Tabular)	Importance-based pruning	Multiple tickets of $m\in\{0,1\}^d$ 599% disjointness at 99% sparsity, near-identical accuracy (Vandersmissen et al., 2023)
RLVR with LLMs	Random masks	1% parameter tickets, minimal Jaccard overlap, full performance up to 99.95% sparsity (Adewuyi et al., 2 Feb 2026)
Mechanism design (VCG)	Sequential prune/scratch	Successful tiny subnetworks discovered by iterative pruning/redraw (Guo, 2023)
Ensembles from Pretrained Models	Diverse IMP masks	Multi-ticket ensembles outperform k-dense-model ensembles; diversity drives ensemble gains (Kobayashi et al., 2022)
Binary/Sparse MPTs	Score-based, binary pruning	Robustness to thinning, inference/training speedups without accuracy loss (Cheng et al., 2022)

These empirical findings underline the breadth of the MTH, from standard vision models through LLM-based RL, to specialized economic mechanisms.

6. Practical and Algorithmic Implications

The Multiple Ticket Hypothesis informs practical methodologies:

Efficient Pruning: Layerwise and randomized pruning strategies can uncover multiple performant sparse models from a single initialization, suitable for parallel or ensemble deployment (Vandersmissen et al., 2023, Kobayashi et al., 2022).
Ensemble Construction: Ensembles of tickets, especially when designed to maximize disagreement, yield superadditive gains versus dense model ensembles, exploiting the diversity in the multiple ticket set (Kobayashi et al., 2022).
Efficient Inference and Training: Advanced ticket extraction methods (power-propagation, thresholding, structured pruning) achieve improvements in robustness to prune ratio, and actual compute benefits in training and inference (Cheng et al., 2022).
Algorithmic Search for Optimality: Systematic algorithms for ticket re-draw and “scratching” (exclusion of failed tickets) facilitate achieving near-theoretical optimality in domains like mechanism design, outperforming brute-force random initialization (Guo, 2023).

7. Open Questions and Future Directions

While the Multiple Ticket Hypothesis is robustly supported, several open directions remain:

Characterizing the Shared Core: The determinants, size, and stability of the "core" intersection among all performant tickets remain incompletely understood; evidence suggests it is essential for convergence but minimal in size (Vandersmissen et al., 2023).
Dimensionality Thresholds: In RLVR, the breakdown point of random ticket success corresponds to the intrinsic dimension of the effective Fisher space, but more explicit links between network, data, and mask structure are needed (Adewuyi et al., 2 Feb 2026).
Automated Ticket Selection: Joint optimization of pruning thresholds, regularizations, or ensembling procedures may further improve robustness and efficiency (Cheng et al., 2022).
Generalization Properties: Whether diversity among tickets correlates to finer generalization bounds or robustness in out-of-distribution settings is an active domain.
Adaptation Beyond Current Domains: Extending MTH principles to other structured models, architectures, or optimization paradigms may expose further theoretical and applied power.

The Multiple Ticket Hypothesis recasts model sparsification not as a search for one privileged subnetwork, but as exploration in a vast structured family of functionally equivalent representations, catalyzing developments in resource-efficient deep learning, optimization, and computational mechanism design (Vandersmissen et al., 2023, Adewuyi et al., 2 Feb 2026, Guo, 2023, Kobayashi et al., 2022, Cheng et al., 2022).