Data Selection with Importance Resampling (DSIR)
- DSIR is a principled framework that projects raw text into a feature space and uses density modeling to compute importance weights for curating training subsets.
- It quantitatively evaluates selection quality with KL-reduction metrics, demonstrating empirical gains in both domain-specific and general language modeling tasks.
- The approach leverages hashed n-gram extraction and multinomial models to efficiently sample from billions of documents, thereby optimizing pretraining data alignment.
Data Selection with Importance Resampling (DSIR) is a scalable and principled framework for curating large-scale pretraining corpora such that the resulting training data more closely aligns—under a specified feature representation—with a desired target distribution. Originally developed to address the impracticality of direct high-dimensional importance sampling for raw text, DSIR leverages feature-space density modeling and sampling-based subset selection to improve transfer to both domain-specific and general-domain language modeling tasks. The method underpins new metrics for evaluating the quality of data selection, delivers empirical gains on standard benchmarks, and connects the modern data-centric approach for pretraining with fundamental principles in statistical learning theory (Xie et al., 2023, Gu et al., 7 Jan 2025, Vogel et al., 2020).
1. Formalization and Theoretical Foundation
DSIR addresses the problem of selecting a representative subset from a large unlabeled dataset so as to match a smaller "target" set , where and are unknown distributions over the data space. Direct computation of importance weights is infeasible in high-dimensional discrete text spaces. DSIR circumvents this by projecting both raw and target data to a lower-dimensional feature space using a mapping , typically via hashed n-gram counts.
The core steps are:
- Featurize each document as , yielding count vectors in (where is the number of hash buckets).
- Fit bag-of-features generative models—multinomials parameterized by , —to target and raw distributions in :
- Define the importance weight for sample :
The parameters and are estimated by normalized feature counts over the respective datasets (Xie et al., 2023, Gu et al., 7 Jan 2025).
Sampling points from without replacement, with probabilities proportional to , produces a subset whose feature distribution approximates that of the target, in the sense of importance resampling.
2. Practical Algorithm and Feature Construction
DSIR typically utilizes both unigrams and bigrams hashed into vectors of dimension (empirically is effective, with diminishing returns above this scale). Each n-gram from a document yields a bucket via a fast noncryptographic hash function , so the feature vector counts occurrences across the hash buckets.
The stepwise procedure for DSIR is:
- Compute hashed n-gram features for all target and raw documents.
- Estimate multinomial parameters , for target/raw datasets.
- For every raw document , compute the importance weight using the formula above.
- Normalize the weights: .
- Sample distinct indices without replacement using the categorical distribution defined by (commonly implemented via the Gumbel–Top- trick).
- Output the selected subset.
This process is inherently scalable. Empirical results show that subsets of 100M documents can be selected from B inputs in approximately 4.5 hours on a 96-core CPU node, where feature extraction and weight computation dominate wall-clock runtime (Xie et al., 2023).
3. Evaluation Metrics: The KL-Reduction Criterion
To quantitatively measure how well selected pretraining data matches the target distribution in the feature space, DSIR introduces the KL-reduction metric. For empirical distributions over (target, raw, and selected subset feature distributions), the KL-reduction over a set of targets is:
where .
High KL-reduction indicates that the selected subset achieves a significant reduction in feature-space divergence relative to the raw pool, predicting stronger alignment with the target and, empirically, better downstream performance. In extensive experiments, the KL-reduction on hashed n-gram features correlates strongly with downstream F1 across multiple methods (Pearson ) (Xie et al., 2023).
4. Empirical Results and Impact
Extensive validation demonstrates DSIR's impact on both domain-adaptive pretraining and general-domain language modeling. Key findings include (Xie et al., 2023, Gu et al., 7 Jan 2025):
- Domain-Specific Pretraining: On eight domain-shifted tasks with 25M selected examples, DSIR yields a +1.2 percentage point F1 increase over random selection and outperforms both heuristic classifiers (+0.9 pp) and often expert-curated selections (+0.3 pp). Within-domain transfer is optimal; mismatched target reduces F1 by roughly 6 pp.
- General-Domain Pretraining: When the target is Wikipedia and Books (GLUE benchmark), DSIR surpasses random selection and GPT-3 heuristic filtering by 2–2.5% in average GLUE scores.
- Comparative Baselines: Random selection and heuristic filtering underperform relative to DSIR-selected subsets; "top-" variants perform similarly but are consistently less effective than full resampling.
- Feature-Space Metrics: DSIR-selected data always achieves maximal KL-reduction and the best downstream F1 under hashed n-gram evaluation.
These results confirm the value of importance-resampled data curation for large-scale pretraining. DSIR's modular feature space and resampling protocol unlock state-of-the-art efficiency and performance for both general and domain-specific LMs.
5. Extensions: Feature Design, Trade-Offs, and Hybrid Methods
The selection of the feature space is central to DSIR's performance. The standard approach employs hashed n-gram statistics due to their tractability and strong alignment with token-level objectives (e.g., masked language modeling). Alternative approaches utilizing sentence-level neural embeddings (e.g., 384-dim Sentence-Transformer GMMs) are possible and can be integrated in hybrid importance-resampling schemes.
Empirical observations on feature choices (Gu et al., 7 Jan 2025):
- N-gram based DSIR is preferred for token-prediction pretraining objectives, offering strong gains on nearly all GLUE tasks.
- Embedding-based variants (as in Hybrid Importance Resampling, or "HIR") yield gains in tasks requiring global sentence-level semantics (notably STS-B), but entail higher computational cost and can suffer from imprecise density estimates in high-dimensional embedding spaces.
- A hybrid importance weight of the form interpolates between n-gram and neural feature spaces. Exhaustive tuning of the interpolation parameter is an open area for optimization.
6. Broader Connections to Importance Sampling and Statistical Learning
DSIR is conceptually linked to importance sampling and weighted empirical risk minimization (ERM) used in statistical learning. Weighted ERM corrects for distribution shift by reweighting loss contributions via the likelihood ratio . DSIR operationalizes this paradigm for raw text by using efficient density estimation in a reduced feature space.
In standard settings (label shift, stratum shift, covariate shift), closed-form weights are known, and rigorous bounds guarantee that using either true or well-estimated weights preserves learning rates, modulo mild dependence on estimation error (Vogel et al., 2020). When ground truth distributional ratios are inaccessible (as with open-domain text), feature-space modeling provides a pragmatic approximation, and single-pass resampling suffices for unbiased risk estimation under the proxy distribution.
Table: Comparison of Weighting Choices in DSIR and Related Approaches
| Bias Scenario | Weight Formula | Notes |
|---|---|---|
| Label Shift | Class priors known/estimated | |
| Stratum Shift | Requires stratum (domain/category) metadata | |
| Covariate Shift | Density-ratio estimation necessary | |
| DSIR (LM) | Multinomial over hashed n-gram features |
7. Implementation Details and Practical Considerations
Efficient implementation of DSIR involves:
- Setting hash buckets for n-gram features; larger yields diminishing returns.
- Using a large but feasible sample from the raw corpus (B hashed buckets) for robust estimation of .
- Excluding trivial documents (e.g., 40 words or heavy repetition) prior to feature extraction.
- Employing numerically stable computations: calculate log-ratios once, then use sparse inner products, and leverage the log-sum-exp trick for normalization.
- Sampling without replacement via the Gumbel–Top- algorithm, ensuring high throughput at web scale.
Open-source implementations and curated data outputs are available (Xie et al., 2023). These practices ensure that DSIR remains practical for billion-scale data selection and can be integrated into industrial-scale LM pipelines.
In sum, Data Selection with Importance Resampling formalizes the alignment of pretraining distributions with a downstream target via single-pass, feature-based density modeling and probabilistic resampling. DSIR connects theoretical guarantees from importance sampling to modern LLM training and has demonstrated reproducible improvements in empirical benchmarks over both random and heuristic selection baselines (Xie et al., 2023, Gu et al., 7 Jan 2025, Vogel et al., 2020).