Papers
Topics
Authors
Recent
Search
2000 character limit reached

Data Selection with Importance Resampling (DSIR)

Updated 13 February 2026
  • DSIR is a principled framework that projects raw text into a feature space and uses density modeling to compute importance weights for curating training subsets.
  • It quantitatively evaluates selection quality with KL-reduction metrics, demonstrating empirical gains in both domain-specific and general language modeling tasks.
  • The approach leverages hashed n-gram extraction and multinomial models to efficiently sample from billions of documents, thereby optimizing pretraining data alignment.

Data Selection with Importance Resampling (DSIR) is a scalable and principled framework for curating large-scale pretraining corpora such that the resulting training data more closely aligns—under a specified feature representation—with a desired target distribution. Originally developed to address the impracticality of direct high-dimensional importance sampling for raw text, DSIR leverages feature-space density modeling and sampling-based subset selection to improve transfer to both domain-specific and general-domain language modeling tasks. The method underpins new metrics for evaluating the quality of data selection, delivers empirical gains on standard benchmarks, and connects the modern data-centric approach for pretraining with fundamental principles in statistical learning theory (Xie et al., 2023, Gu et al., 7 Jan 2025, Vogel et al., 2020).

1. Formalization and Theoretical Foundation

DSIR addresses the problem of selecting a representative subset from a large unlabeled dataset X={x1,...,xN}q(x)\mathcal{X} = \{x_1, ..., x_N\} \sim q(x) so as to match a smaller "target" set X={x1,...,xn}p(x)\mathcal{X'} = \{x'_1, ..., x'_n\} \sim p(x), where pp and qq are unknown distributions over the data space. Direct computation of importance weights w(x)=p(x)/q(x)w(x) = p(x)/q(x) is infeasible in high-dimensional discrete text spaces. DSIR circumvents this by projecting both raw and target data to a lower-dimensional feature space Z\mathcal{Z} using a mapping h:XZh: \mathcal{X} \to \mathcal{Z}, typically via hashed n-gram counts.

The core steps are:

  1. Featurize each document xx as z=h(x)z = h(x), yielding count vectors in Nm\mathbb{N}^m (where mm is the number of hash buckets).
  2. Fit bag-of-features generative models—multinomials parameterized by γp\gamma^p, γq\gamma^q—to target and raw distributions in Z\mathcal{Z}:

pZ(z)=Prxp[h(x)=z],qZ(z)=Prxq[h(x)=z]p_Z(z) = \Pr_{x\sim p}[h(x) = z],\quad q_Z(z) = \Pr_{x\sim q}[h(x) = z]

  1. Define the importance weight for sample xix_i:

wi=pZ(h(xi))qZ(h(xi))=j=1m(γ^j(p)γ^j(q))zj(xi)w_i = \frac{p_Z(h(x_i))}{q_Z(h(x_i))} = \prod_{j=1}^m \biggl(\frac{\hat\gamma_j^{(p)}}{\hat\gamma_j^{(q)}}\biggr)^{z_j(x_i)}

The parameters γ^j(p)\hat\gamma_j^{(p)} and γ^j(q)\hat\gamma_j^{(q)} are estimated by normalized feature counts over the respective datasets (Xie et al., 2023, Gu et al., 7 Jan 2025).

Sampling kk points from X\mathcal{X} without replacement, with probabilities proportional to {wi}\{w_i\}, produces a subset whose feature distribution approximates that of the target, in the sense of importance resampling.

2. Practical Algorithm and Feature Construction

DSIR typically utilizes both unigrams and bigrams hashed into vectors of dimension mm (empirically m=10000m=10\,000 is effective, with diminishing returns above this scale). Each n-gram tt from a document yields a bucket b=H(t)modmb = H(t) \bmod m via a fast noncryptographic hash function HH, so the feature vector zz counts occurrences across the hash buckets.

The stepwise procedure for DSIR is:

  1. Compute hashed n-gram features for all target and raw documents.
  2. Estimate multinomial parameters γ^(p)\hat\gamma^{(p)}, γ^(q)\hat\gamma^{(q)} for target/raw datasets.
  3. For every raw document xix_i, compute the importance weight wiw_i using the formula above.
  4. Normalize the weights: w^i=wi/i=1Nwi\hat w_i = w_i / \sum_{i=1}^N w_i.
  5. Sample kk distinct indices without replacement using the categorical distribution defined by {w^i}\{\hat w_i\} (commonly implemented via the Gumbel–Top-kk trick).
  6. Output the selected subset.

This process is inherently scalable. Empirical results show that subsets of 100M documents can be selected from N1.6N \approx 1.6B inputs in approximately 4.5 hours on a 96-core CPU node, where feature extraction and weight computation dominate wall-clock runtime (Xie et al., 2023).

3. Evaluation Metrics: The KL-Reduction Criterion

To quantitatively measure how well selected pretraining data matches the target distribution in the feature space, DSIR introduces the KL-reduction metric. For empirical distributions p,q,pp, q, p' over Z\mathcal{Z} (target, raw, and selected subset feature distributions), the KL-reduction over a set P\mathcal{P} of targets is:

KL-reduction(p;q,P)=1PpP[KL(pq)KL(pp)]\mathrm{KL\text{-}reduction}(p'; q, \mathcal{P}) = \frac{1}{|\mathcal{P}|} \sum_{p\in\mathcal{P}} \left[ \mathrm{KL}(p\|q) - \mathrm{KL}(p\|p') \right]

where KL(pr)=jp[j]logp[j]r[j]\mathrm{KL}(p\|r) = \sum_j p[j]\log\frac{p[j]}{r[j]}.

High KL-reduction indicates that the selected subset achieves a significant reduction in feature-space divergence relative to the raw pool, predicting stronger alignment with the target and, empirically, better downstream performance. In extensive experiments, the KL-reduction on hashed n-gram features correlates strongly with downstream F1 across multiple methods (Pearson r=0.82r=0.82) (Xie et al., 2023).

4. Empirical Results and Impact

Extensive validation demonstrates DSIR's impact on both domain-adaptive pretraining and general-domain language modeling. Key findings include (Xie et al., 2023, Gu et al., 7 Jan 2025):

  • Domain-Specific Pretraining: On eight domain-shifted tasks with 25M selected examples, DSIR yields a +1.2 percentage point F1 increase over random selection and outperforms both heuristic classifiers (+0.9 pp) and often expert-curated selections (+0.3 pp). Within-domain transfer is optimal; mismatched target reduces F1 by roughly 6 pp.
  • General-Domain Pretraining: When the target is Wikipedia and Books (GLUE benchmark), DSIR surpasses random selection and GPT-3 heuristic filtering by 2–2.5% in average GLUE scores.
  • Comparative Baselines: Random selection and heuristic filtering underperform relative to DSIR-selected subsets; "top-kk" variants perform similarly but are consistently less effective than full resampling.
  • Feature-Space Metrics: DSIR-selected data always achieves maximal KL-reduction and the best downstream F1 under hashed n-gram evaluation.

These results confirm the value of importance-resampled data curation for large-scale pretraining. DSIR's modular feature space and resampling protocol unlock state-of-the-art efficiency and performance for both general and domain-specific LMs.

5. Extensions: Feature Design, Trade-Offs, and Hybrid Methods

The selection of the feature space is central to DSIR's performance. The standard approach employs hashed n-gram statistics due to their tractability and strong alignment with token-level objectives (e.g., masked language modeling). Alternative approaches utilizing sentence-level neural embeddings (e.g., 384-dim Sentence-Transformer GMMs) are possible and can be integrated in hybrid importance-resampling schemes.

Empirical observations on feature choices (Gu et al., 7 Jan 2025):

  • N-gram based DSIR is preferred for token-prediction pretraining objectives, offering strong gains on nearly all GLUE tasks.
  • Embedding-based variants (as in Hybrid Importance Resampling, or "HIR") yield gains in tasks requiring global sentence-level semantics (notably STS-B), but entail higher computational cost and can suffer from imprecise density estimates in high-dimensional embedding spaces.
  • A hybrid importance weight of the form ωi(hyb)=(ωi(ng))α(ωi(nn))1α\omega_i^{(\mathrm{hyb})} = (\omega_i^{(\mathrm{ng})})^\alpha (\omega_i^{(\mathrm{nn})})^{1-\alpha} interpolates between n-gram and neural feature spaces. Exhaustive tuning of the interpolation parameter α\alpha is an open area for optimization.

6. Broader Connections to Importance Sampling and Statistical Learning

DSIR is conceptually linked to importance sampling and weighted empirical risk minimization (ERM) used in statistical learning. Weighted ERM corrects for distribution shift by reweighting loss contributions via the likelihood ratio Φ(z)=dP/dP(z)\Phi(z) = dP/dP'(z). DSIR operationalizes this paradigm for raw text by using efficient density estimation in a reduced feature space.

In standard settings (label shift, stratum shift, covariate shift), closed-form weights are known, and rigorous bounds guarantee that using either true or well-estimated weights preserves learning rates, modulo mild dependence on estimation error (Vogel et al., 2020). When ground truth distributional ratios are inaccessible (as with open-domain text), feature-space modeling provides a pragmatic approximation, and single-pass resampling suffices for unbiased risk estimation under the proxy distribution.

Bias Scenario Weight Formula Notes
Label Shift p/p, (1p)/(1p)p/p',\ (1-p)/(1-p') Class priors known/estimated
Stratum Shift ps/psp_s/p'_s Requires stratum (domain/category) metadata
Covariate Shift q(x)/r(x)q(x)/r(x) Density-ratio estimation necessary
DSIR (LM) pZ(z)/qZ(z)p_Z(z)/q_Z(z) Multinomial over hashed n-gram features

7. Implementation Details and Practical Considerations

Efficient implementation of DSIR involves:

  • Setting m=10000m = 10\,000 hash buckets for n-gram features; larger mm yields diminishing returns.
  • Using a large but feasible sample from the raw corpus (s1s \sim 1B hashed buckets) for robust estimation of γ^(q)\hat\gamma^{(q)}.
  • Excluding trivial documents (e.g., <<40 words or heavy repetition) prior to feature extraction.
  • Employing numerically stable computations: calculate log-ratios once, then use sparse inner products, and leverage the log-sum-exp trick for normalization.
  • Sampling without replacement via the Gumbel–Top-kk algorithm, ensuring high throughput at web scale.

Open-source implementations and curated data outputs are available (Xie et al., 2023). These practices ensure that DSIR remains practical for billion-scale data selection and can be integrated into industrial-scale LM pipelines.


In sum, Data Selection with Importance Resampling formalizes the alignment of pretraining distributions with a downstream target via single-pass, feature-based density modeling and probabilistic resampling. DSIR connects theoretical guarantees from importance sampling to modern LLM training and has demonstrated reproducible improvements in empirical benchmarks over both random and heuristic selection baselines (Xie et al., 2023, Gu et al., 7 Jan 2025, Vogel et al., 2020).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Data Selection with Importance Resampling (DSIR).