Leaner-Pretrain Dataset Overview

Updated 31 January 2026

Leaner-pretrain datasets are deliberately reduced, noise-minimized, and optimally curated subsets that maintain core data distributions for efficient model training.
They are constructed using methods like conditional filtering, curriculum-based pruning, and Shapley-value analysis to selectively retain influential data points.
These datasets achieve significant compute reductions while sustaining or enhancing model convergence and downstream performance across various domains.

A leaner-pretrain dataset is a deliberately reduced, noise-minimized, and optimally curated subset of a much larger raw pretraining corpus. The principal objective is to maximize training efficiency and generalization for neural models—whether vision, language, or multimodal—while significantly reducing computational overhead. These datasets are constructed through filtering, data minimization, or automated selection techniques that preserve essential distributional, domain, or functional properties required for downstream tasks. Approaches include conditional data selection based on target tasks, curriculum-based difficulty pruning, Shapley-value analysis for data attribution, and various signal-preserving denoising or simplification procedures.

1. Rationale and Definition

Leaner-pretrain datasets address the prohibitive resource cost and inherent redundancy of conventional large-scale pretraining corpora. By judiciously selecting or refining a fraction of the full data—typically 6–25% for vision datasets (Chakraborty et al., 2020, Wang et al., 2023), ~10% for instruction fine-tuning (He et al., 2024), or by aggressive simplification to one-third the original size for linguistic datasets (Yang et al., 2024)—it is possible to sustain or even improve model convergence rates, generalization, and downstream efficacy. The essential principles are noise reduction, preservation of domain or task-relevant distributions, and mitigation of misalignment or informational collapse.

2. Construction Methodologies

A. Conditional Filtering (Task-Aware Data Selection)

Conditional data filtering tailors the subset selection to be maximally relevant for a specific target dataset or task. Two principal modes are established (Chakraborty et al., 2020):

Clustering-based: Embed the target data, cluster it, and select source samples closest to cluster centroids by some aggregation metric (mean/min of distances).
Domain Classifier-based: Train a binary classifier to distinguish target from source samples and rank all source samples by target-likeness.

Both methods enable aggressive downsampling (e.g., to 6% or 12% of ImageNet) with downstream accuracy degradation of ≤1–4% when selecting the tightest budgets. The domain classifier method shows greater stability across tasks and settings.

B. Curriculum-Based Loss-Guided Pruning

The Scale Efficient Training (SeTa) pipeline (Zhou et al., 17 Mar 2025) reduces data through:

Random pruning to eliminate duplicates/redundancies.
1-D k-means clustering on per-sample loss to group samples by "difficulty".
Sliding window schedules (easy-to-hard or vice versa) select dynamically evolving active subsets.
Partial annealing (random sampling) in final epochs to mitigate curriculum bias.

SeTa supports 30–50% pruning on both synthetic and natural data—text and vision—incurring less than 1% loss in accuracy or even slight improvements.

C. Shapley-Value Subset Selection

SHED (He et al., 2024) formalizes per-example data value as the Shapley value of its marginal contribution to a downstream metric (e.g., accuracy after inclusion in a fine-tuning set). To make Shapley computation tractable, SHED:

Clusters embedding representations to select proxies.
Estimates Shapley via Monte Carlo block-removal experiments within the proxy set.
Ranks clusters for quality-aware sampling (ordered or weighted).

Pragmatically, datasets as small as 10% of the original—selected by SHED—consistently match or surpass full-dataset performance across LLM architectures and transfer to other models.

D. Denoising and Complexity Reduction

The Leaner-Pretrain corpus (Yang et al., 2024) exemplifies direct noise and complexity minimization:

Samples from canonical pretraining domains (web, books, wiki, code, math).
Enforces a uniform, minimal 2,000-token capped vocabulary via LLM-based rephrasing.
Applies simplification prompts to remove formatting/semantic noise, limit entity diversity, and simplify context details.
Preserves n-gram entropy and coarse domain ratios, ensuring that distributional alignment to conventional mixes is maintained.

Analogous approaches in vision-language (TL;DR (Wang et al., 2023)) leverage codebook-based clustering on image features, accompanied by automated sample selection and caption refinement to reduce VLP datasets by up to 84%.

E. Synthetic Lean Coresets for Specialized Domains

The Lean Workbook (for Lean 4 theorem proving) (Ying et al., 2024) demonstrates bootstrapping from formal–informal math pairs, large-scale synthetic translation, automatic filtering via REPL compilation and back-translation NLI, human-in-the-loop corrections, and multi-task fine-tuning. Final dataset "Lean Workbook" achieves 93.5% translation accuracy and represents a lean, high-utility pretraining set for formal reasoning agents.

3. Empirical Performance and Quantitative Trade-offs

Leaner-pretrain strategies are empirically validated by detailed ablation studies and benchmark assessments:

Dataset/Domain	Pruning Ratio	Main Result	Source
ImageNet (vision)	6–12%	≤1–4% drop in accuracy, 70–90% compute cut	(Chakraborty et al., 2020)
BLIP on CC3M (VLP)	25%	+1 to +6% in retrieval/task metrics, 75% data eliminated	(Wang et al., 2023)
MMLU/WizardLM LLM Finetune	10%	Matches (±0.7%) or surpasses full-data accuracy, robust x-architecture	(He et al., 2024)
Leaner-Pretrain LM	71% (vs 100M)	LM test loss 0.90 (vs 1.10), +12–40 points GLUE, +0.3–0.4 instruction eval.	(Yang et al., 2024)
SeTa: LLaMA-7B instruct	26%	+1 pt HumanEval vs baseline, minimal loss on MMLU	(Zhou et al., 17 Mar 2025)

Standard cells are kept concise for comparison; further details on evaluation, architectures, and fine-tuning hyperparameters are in the cited works.

4. Distributional Alignment and Statistical Properties

Leaner-pretrain datasets are explicitly constructed to maintain domain or genre proportions and statistical primitives. In Leaner-Pretrain (Yang et al., 2024), reduction in $n$ -gram entropy ( $H_1:16.41\to15.86$ , $H_2:22.76\to21.16$ , $H_3:24.51\to23.25$ ) is observed, yet support and alignment relative to standard training mixes are preserved. Empirical complexity bounds formalize how simplifying grammar and capping vocabulary lower combinatorial dataset complexity (Appendix A, (Yang et al., 2024)).

In VLP domains, codebook-based embedding clustering preserves image-text alignment after refinement, while the TL;DR approach (Wang et al., 2023) demonstrates that careful selection/refinement can nearly eliminate low alignment-scoring pairs.

5. Practical Guidelines and Limitations

Key best practices for building and using leaner-pretrain datasets include:

Conservative pruning for initial studies ( $r \approx 0.3$ –$0.5$).
Selecting $K=8$ –$16$ clusters for k-means, window scale $\alpha \approx 0.5$ in SeTa (Zhou et al., 17 Mar 2025).
For domain classifier conditional filtering, monitoring classifier accuracy (92–95% optimal) is crucial (Chakraborty et al., 2020).
Ensuring the final corpus preserves coverage of core domain-subdomain distributions; over-aggressive or naive random pruning can induce catastrophic forgetting or domain drift.
Integration of annealing or random sampling in final epochs is recommended to correct selective curricula bias (Zhou et al., 17 Mar 2025).
Automated procedures (e.g., SHED) reduce the need for human-in-the-loop, but rare or edge cases may still be under-represented.

Limitations are context-dependent: potential under-selection in rare categories (Shapley-based), high up-front overhead (some conditional filtering), or difficulty scaling to web-scale without clustering/proxy shortcuts. Extreme domain shift reduces the efficacy of filtering methods (Chakraborty et al., 2020).

6. Applications and Broader Impact

Leaner-pretrain datasets have been realized in supervised and unsupervised image pretraining, vision-language retrieval, instruction fine-tuning for LLMs, formal theorem proving, and multi-task agents. Across modalities, these datasets lower the barrier to experimentation and pretraining for resource-limited research groups and prompt systematic study of curriculum and data quality. Their statistical and empirical alignment with large pretraining corpora supports not only efficiency but also nuanced investigation of model scaling, objective selection, and transfer phenomena (Chakraborty et al., 2020, Wang et al., 2023, He et al., 2024, Yang et al., 2024, Ying et al., 2024, Zhou et al., 17 Mar 2025).