Data Selection via Learning Percentage

Updated 31 January 2026

The paper introduces a formal framework that selects a subset of data meeting a target learning percentage while minimizing cost using submodular optimization.
It demonstrates that greedy algorithms yield strong approximation guarantees for cost-efficient subset selection under monotonic and submodular conditions.
The method scales to large datasets and supports applications in LLM pretraining, curriculum learning, and active learning through reinforcement and uncertainty strategies.

Data selection via learning percentage refers to the formalized process of selecting a subset of data sources or samples, often under a budget constraint, such that the chosen subset guarantees a prescribed proportion of the achievable learning performance. Originating in Bayesian learning and extending to empirical risk minimization, curriculum learning, subset selection for large-scale pretraining, and active learning, the “learning percentage” is operationalized as a threshold ρ or as a fixed fraction of samples, with the aim of minimizing cost or label budget while achieving (or closely matching) the full-data predictive power. This concept has become central in the design and analysis of data-centric learning systems across theory and practice.

1. Formalization of the Data Selection Problem

The canonical problem is to select a subset $S \subset \Omega$ of sources or data points (with $|\Omega| = n$ ) that incurs minimum total cost while guaranteeing that a monotonic “learning-gain” function $f(S)$ meets or exceeds a target threshold $\rho$ (the learning percentage). In mathematical terms (Ye et al., 2020): $\min_{S \subseteq \Omega} \sum_{i \in S} c_i \quad \text{subject to} \quad f(S) \geq \rho$ where:

$c_i > 0$ denotes the cost associated with item/source $i$
$f$ is a nonnegative, normalized (i.e., $f(\varnothing)=0$ ), monotone, submodular function quantifying the utility or performance
$\rho$ specifies the target fraction of full achievable performance

In Bayesian data source selection (Ye et al., 2020), $f(S) = z(S)$ is a normalized error-bound coverage, also interpretable as information gain, mutual information, or variance reduction. In supervised learning, $f$ may capture accuracy, explained variance, representation coverage, or diversity. Solutions adhere to cardinality (|S| ≤ k, with p = k/n specifying the fraction) or value-oriented constraints (f(S) ≥ ρ).

2. Theoretical Properties and Algorithmic Hardness

Data selection via learning percentage is, in general, computationally intractable. For example, in the Bayesian setting the problem is shown to be NP-hard via a reduction from Set Cover—even when all costs are unitary (Ye et al., 2020). The key insight is that requiring the error on a particular hypothesis to reach zero is equivalent to covering all elements in a universe, which directly maps to classical set cover.

Despite this hardness, the objective admits a favorable structure: for a wide class of performance measures, including those derived from mutual information, variance reduction, or information gain, the set function $f$ is monotonic and submodular. This renders the problem amenable to algorithms with provable approximation guarantees.

3. Submodular Set Cover and Greedy Approximations

When $f$ is monotone submodular, the data selection task becomes a submodular set cover problem. The canonical greedy algorithm iteratively selects the item with the highest normalized marginal gain per unit cost, updating the subset until the learning threshold is met (Ye et al., 2020, Kaushal et al., 2019). Greedy methods are accompanied by strong guarantees: if $S^*$ is optimal and $M = \max_{j} f(\{j\})$ , then the cost of the greedy solution $h(S_\mathrm{greedy})$ satisfies

$h(S_\mathrm{greedy}) \leq (1 + \ln M) h(S^*)$

(Ye et al., 2020). Approximations for subset selection over coverage/dispersion functions (Facility-Location, Disparity-Min) yield (1−1/e) and 1/2 multiplicative factors for the achievable learning value (Kaushal et al., 2019). Fast threshold-based variants accelerate runtime with minimal degradation in guarantee: a (1/(1−ε)) multiplicative factor, with query complexity scaling as $O((n/\epsilon)\ln(n/\epsilon))$ (Ye et al., 2020).

4. Learning Percentage in Modern Data- and Label-Efficient Training

The learning percentage framework directly informs practical selection in massive data settings. In large-scale LLM pretraining (Fan et al., 30 Dec 2025), policy-gradient based mask learning enforces a strict selection fraction (e.g., exactly 10% of 15T tokens in FineWeb) while optimizing a joint quality-diversity objective: $R(\theta) = \mathbb{E}_{M \sim p_\theta}\left[\alpha Q(M) + \beta D(M)\right]$ Sampling is constrained to $\sum_{i=1}^N M_i = S$ (fraction learned is $S/N$ ), and policy gradients ensure parameter updates favor high expected utility subject to the selection percentage. Hard constraints on subset size avoid the need for penalty terms or soft targets. With this approach, a 10% (“learning percentage”) subset suffices to improve downstream LLM performance over quality-only or diversity-only baselines (Fan et al., 30 Dec 2025).

In curriculum and active learning, adaptive policies (reinforcement learning (Fan et al., 2017), curriculum progression based on learner ability (Lalor et al., 2020), or batch-level subset selection (Kaushal et al., 2019)) govern the fraction of data presented at each stage. Stagewise or epoch-based methods dynamically expand or contract the percentage of included data, often as a function of an estimated learning curve, reward, or model confidence.

5. Performance Bounds and Statistical Guarantees

Learning-theoretic analyses formalize how small a learning percentage suffices for various ERMs. For mean estimation in ℝ, the worst-case relative risk after optimal n-point selection is at most $1 + 1/(2n-1)$, and for linear classification in ℝ^d, error is zero as soon as n > d (Hanneke et al., 20 Apr 2025). In strictly convex stochastic optimization, n > d points suffice to match population loss.

In high-dimensional or weakly supervised regimes, optimal sampling probabilities (π(x)) and the associated learning performance as a function of γ = n/N (the “learning percentage”) are derived. Soft subsampling with tuned exponents (e.g., π(x) ∝ [uncertainty score(x)]^α), with α ≈ 0.5, achieves within 1–2% of full-data error for γ ≈ 0.4–0.6, even with weak surrogate models (Kolossov et al., 2023). Non-reweighted deterministic selection can outperform reweighted unbiased schemes, especially as γ decreases toward the practical regime.

Typical empirical results reveal that for diverse computer vision and NLP tasks, 10–40% of the original data (selected by submodular, uncertainty, or RL-based policies) achieves 95–100% of full-data performance (Kaushal et al., 2019, Chandhok et al., 1 Jun 2025, Fan et al., 30 Dec 2025). In some cases, carefully chosen fractions even improve out-of-sample accuracy over the full set, producing a negative gap.

6. Practical Algorithms and Selection Criteria

Algorithmic instantiations span a spectrum:

Greedy and accelerated greedy set function maximization (Facility-Location, Dispersion, coverage functions) (Ye et al., 2020, Kaushal et al., 2019)
Policy-gradient mask learning with hard subset budget constraints (Fan et al., 30 Dec 2025)
Soft or thresholded uncertainty-based selection with tunable exponents, sometimes guided by influence function or curvature calculations (Kolossov et al., 2023)
Reinforcement learning policies that optimize for rapid convergence or data efficiency (Fan et al., 2017)
Dynamic data inclusion based on learner ability relative to difficulty, thresholded at each stage (Lalor et al., 2020) Practical implementation includes pilot runs to estimate accuracy vs. percentage curves, tuning of the balance between diversity and representation, and interpolation between random/sample-hardness-driven and submodular objectives (Kaushal et al., 2019, Chandhok et al., 1 Jun 2025).

7. Empirical Insights and Recommendations

For multi-class and high-redundancy datasets, representation-oriented selection (Facility-Location) is superior; for binary, highly redundant data, diversity (Dispersion) is preferable (Kaushal et al., 2019)
Validating learning percentage via curves or cross-validation identifies plateau points (knee p*) beyond which additional percentage yields diminishing accuracy returns
In extremely large-scale or resource-constrained scenarios, pre-filtering (e.g., via quality scores) before joint selection further accelerates subset learning (Fan et al., 30 Dec 2025)
Non-reweighted, thresholded selection strategies are robust, and the optimal fraction is problem-dependent but often low (10–50%) (Kolossov et al., 2023)
In true weak supervision or surrogate-limited scenarios, even marginally informative surrogates suffice for near-optimal data selection (Kolossov et al., 2023)
Data selection via learning percentage unifies set cover, sample compression, coresets, and curriculum learning in a data-centric learning theory (Hanneke et al., 20 Apr 2025)

Table: Methods and Guarantees for Data Selection via Learning Percentage

Paper / Framework	Selection Mechanism	Typical Guarantee / Regime
(Ye et al., 2020) (Bayesian SC)	Submodular greedy/fast-greedy	Cost ≤ (1+ln M)·Opt, runs in O(n²); fast variant: O((n/ε)ln(n/ε))
(Fan et al., 30 Dec 2025) (DATAMASK)	Policy-grad mask, hard budget	≈10% of tokens, +3.2%/1.9% vs. baselines (LLMs)
(Kaushal et al., 2019) (Submodular subset, active)	Facility-Location, Dispersion, greedy	95–100% accuracy at 10–50% subset, (1–1/e) or 1/2 approx. factor
(Kolossov et al., 2023) (Weak supervision)	Uncertainty/influence-thresholded	<2% error gap at γ ≈ 0.4–0.6, optimal π(x) yields minimum excess risk
(Lalor et al., 2020) (Curriculum)	IRT-based ability threshold	Dynamic %S grows/shrinks per epoch; fastest convergence
(Fan et al., 2017) (RL/NDF)	RL-learned filtering	Up to 50% instance reduction, matches full-data accuracy

Data selection via learning percentage thus provides a principled, theory-grounded, and empirically validated methodology for cost-efficient, high-performance model training across a diverse range of modern ML settings. The dominant open questions concern further tightening of selection bounds in various regimes, development of practical near-linear-time algorithms, and extensions to non-Euclidean, nonconvex, or pretrained representation scenarios.