Difficulty-Aware Rejection Sampling

Updated 16 January 2026

The paper introduces difficulty-aware rejection sampling, an adaptive technique that adjusts proposal strategies based on the challenge of the target function to reduce rejection rates.
It is applied in various domains, including adaptive Monte Carlo methods, synthetic data generation for imbalanced learning, and MCMC latent variable inference with established performance bounds.
The approach leverages dynamic envelope refinement and empirical evaluations to achieve near-minimax optimal rejection rates while balancing computational and statistical efficiency.

Difficulty-aware rejection sampling refers to a class of adaptive algorithms designed to improve the statistical and computational efficiency of rejection sampling procedures by tuning allocation or proposal mechanisms according to the specific difficulty of the target function, query, or conditional. The paradigm is instantiated in distinct domains: adaptive Monte Carlo methods for continuous densities (Achdou et al., 2018), @@@@1@@@@ for imbalanced or hard-label distribution learning (Tong et al., 2024), and exact latent variable inference in MCMC Gibbs samplers (Raim et al., 21 Sep 2025). Procedures are engineered to address either locally challenging regions of the target distribution or queries with empirically low acceptance probabilities, and they often provide performance guarantees with respect to minimax lower bounds, rejection rates, and practical computational trade-offs.

1. Principles of Adaptive Rejection Sampling

Classic rejection sampling seeks to sample from a target density $f$ by drawing candidates $x$ from a tractable proposal $g$ and accepting those with probability $f(x)/(M g(x))$ for a suitable envelope constant $M \geq \sup_x f(x)/g(x)$ . Standard methods suffer from high rejection rates when $f$ is complex or multimodal, especially in high dimensions or in the absence of convenient global proposals.

Difficulty-aware (or adaptive) rejection sampling modifies the mechanism to dynamically estimate suitable envelope or proposal functions by leveraging information from previous sample attempts. In the NNARS framework (Achdou et al., 2018), the procedure constructs piecewise-constant approximations to $f$ on a grid and maintains confidence radii. This adaptivity allows the envelope to "focus" more efficiently on difficult regions, reducing overall rejection.

In other domains, such as synthetic data generation for instruction-fine-tuning LLMs, difficulty-awareness is defined by allocating more generative trials to queries for which acceptance probabilities are empirically low, directly countering dataset bias and enhancing learning for rare or challenging cases (Tong et al., 2024).

2. Minimax Lower Bounds and Near-Optimal Adaptive Algorithms

Achdou et al. (Achdou et al., 2018) formalize the performance limits of difficulty-aware rejection sampling over a class of $s$ -Hölder smooth densities on $[0,1]^d$ bounded from below. Any adaptive procedure $A$ that performs $n$ total function evaluations, and produces $\hat n$ accepted samples, incurs a loss $L_n = n - \hat n$ . The minimax lower bound asserts that, for all sufficiently large $n$ ,

$\inf_A \sup_{f \in \mathcal{F}_0(s,H,c_f,d)} \mathbb{E}_f[L_n(A)] \geq C(s,d)\, n^{1-s/d}$

for $C(s,d) > 0$ , and explicit constants are given. This quantifies the irreducible cost of sampling under limited regularity.

The NNARS algorithm achieves a rejection rate upper bound up to a logarithmic factor:

$\mathbb{E}_f[L_n(\mathrm{NNARS})] \leq C\, \log^2(n) \, n^{1-s/d}$

with $C$ computable from model parameters. NNARS does not require log-concavity, only Hölder regularity and positivity, and iteratively refines its proposal envelopes via nearest-neighbor estimation. This near-minimax-optimality distinguishes NNARS from prior adaptive methods such as PRS, ARS, or A*-type samplers, which impose stricter regularity or structure requirements and often lack explicit finite-sample guarantees.

3. Difficulty-Aware Sampling in Synthetic Data Generation

In the context of training LLMs to solve mathematical problems, DART-Math (Tong et al., 2024) introduces difficulty-aware rejection tuning by quantifying per-query failure rates $d(q)$ using an external synthesis model. For a set of training queries $Q$ , each query $q$ is scored according to the proportion of incorrect answers among $n_d$ chain-of-thought samples.

Accepted synthetic samples $(q,y)$ are only retained if the final generated answer matches ground truth. The difficulty-aware allocation is then realized by either (i) uniform targets $T_\mathrm{uniform}(q) = k_u$ for all $q$ , or (ii) proportional-to-difficulty targets $T_\mathrm{prop2diff}(q) = \max(1, \lround k_p\, d(q) / d_\max \rround)$, subject to a cap $n_\mathrm{max}$ on total trials per query.

The rationale is that dedicating more attempts to hard queries (high $d(q)$ ) yields more informative data for subsequent model training, mitigating "easy-query bias," accelerating learning of complex reasoning pathways, and producing competitive models and datasets with lower sample complexity than prior vanilla rejection samplers. Empirical evaluations across six benchmarks reveal multi-point gains over vanilla rejection tuning and several public baselines, with robust improvements in mathematical reasoning tasks.

4. Self-Tuned Rejection Sampling in MCMC and Latent Variable Models

When performing Gibbs sampling for Bayesian hierarchical models, certain conditionals may correspond to complex densities without tractable direct samplers. The self-tuned vertical weighted strips (VWS) method (Raim et al., 21 Sep 2025) treats the target density $f(x) = w(x) g(x) / \psi$ with $g$ a base density and $w$ a positive weight, partitioning the support into strips $(\alpha_{j-1}, \alpha_j]$ .

Each strip is bounded by $\overline w_j$ and $\underline w_j$ , constructing a proposal mixture $h(x)$ of truncated- $g$ components. The maximal rejection rate is bounded by $\rho_+ = 1 - (\sum_j \underline \xi_j)/(\sum_j \overline \xi_j)$ and can be locally decomposed into strip-wise contributions $\rho_j$ .

Self-tuning proceeds by incrementally refining strips with high $\rho_j$ or coarsening strips with negligible $\rho_j$ , only updating the proposals when required. The method persists proposal distributions across Gibbs iterations, limiting both computational overhead and high rejection incidences. In a case study on small area estimation, self-tuned VWS yields exact conditional draws, high effective sample sizes ( $\mathrm{ESS} > 1000$ ), and computational workloads substantially reduced compared to rebuilding proposals or non-adaptive IMH samplers.

5. Computational and Practical Aspects

Computational cost for difficulty-aware samplers is a product of envelope maintenance, proposal mixture updates, and the underlying acceptance mechanism. For NNARS (Achdou et al., 2018), overall compute is $O(n \log n)$ with $O(n)$ memory, and each proposal is updated in $O(|\chi|\,d)$ per round; envelope sampling uses alias or binary search in $O(\log |\chi|)$ .

Self-tuned VWS (Raim et al., 21 Sep 2025) achieves $O(\log N)$ update steps for knot adjustments, and candidate draws require singular truncated- $g$ samples. Practical settings for user tolerances $(\epsilon_1, \epsilon_2)$ balance proposal complexity against acceptance rates; moderate values empirically deliver high efficiency.

DART-Math (Tong et al., 2024) synthesizes $\sim$ 150M raw samples at $\sim$ 35K/hr on a single A100 GPU, with final accepted datasets of 590K samples per variant after difficulty-aware filtering. The synthesis procedure, including chain-of-thought prompt generation and automatic answer checking, is amortized—downstream users need only the compact difficulty-enhanced dataset.

Difficulty-aware rejection samplers improve upon a range of prior adaptive and nonadaptive methods:

Method	Assumptions	Finite-Sample Guarantee
NNARS	Hölder regularity, $f>0$	Minimax bound, near-optimal (Achdou et al., 2018)
PRS	Kernel estimator, $s$ -Hölder	No minimax bound; slower rate
ARS	Log-concavity of $f$	Asymptotic only
A/OS	Tractable decomposition	No minimax guarantee
DART-Math	Query difficulty (empirical)	Outperforms VRT, efficient data size (Tong et al., 2024)
Self-tuned VWS	Piecewise analysis, no log-concavity	Controllable rejection, robust mixing (Raim et al., 21 Sep 2025)

Difficulty awareness is realized by (a) modulating the proposal or envelope structure according to empirical or analytic difficulty, (b) allocating sampling trials in proportion to local acceptance rates or global challenge, and (c) providing either minimax or practical mixing guarantees beyond classic generic approaches. This paradigm is now standard in both continuous distribution Monte Carlo and synthetic data pipeline domains.

7. Empirical Performance and Case Study Results

Difficulty-aware methods yield substantial gains over naive or standard techniques:

NNARS (Achdou et al., 2018) matches the theoretical rejection bound up to logarithmic factors and outperforms PRS and ARS under mild regularity conditions.
DART-Math (Tong et al., 2024) improves instruction-tuned LLM accuracy by $+4.5$ average points over vanilla sampling across six mathematical benchmarks (with public models), matches or exceeds state-of-the-art systems with significantly smaller datasets, and is not dependent on proprietary models (such as GPT-4). On the MATH dataset, Llama3-8B + Prop2Diff achieves $46.6\%$ compared to $39.7\%$ for vanilla rejection tuning, and $32.5\%$ for MetaMath.
Self-tuned VWS (Raim et al., 21 Sep 2025) increases minimum and percentile effective sample sizes (ESS) for latent variance components ( $>1100$ vs $<70$ for IMH), reduces rejections by $2\times$ , and completes cycles in $2$ minutes vs $39$ minutes or $30$ seconds for alternate strategies. Knot updates stabilize to $\sim$ 0--4 per iteration after warm-up.

A plausible implication is that difficulty-aware allocation strategies, when deployed in either density-based Monte Carlo or query selection in synthetic learning pipelines, systematically improve both asymptotic and practical performance, favoring their adoption in modern inference and data-centric workflows.