Papers
Topics
Authors
Recent
Search
2000 character limit reached

Rejection Sampling Fine-Tuning (RSFT)

Updated 7 February 2026
  • RSFT is a fine-tuning paradigm that uses rejection sampling to filter self-generated outputs based on an external correctness or reward signal.
  • It systematically generates, scores, and selects candidate outputs, then uses only the correct responses for supervised model fine-tuning.
  • Variants such as AdaSTaR, RIFT, and TrajFusion enhance RSFT by incorporating adaptive sampling and reward-weighted objectives to overcome data inefficiency and overfitting.

Rejection Sampling Fine-Tuning (RSFT), also referred to as Rejection Sampling Fine-Tuning (RFT) or STaR in some literature, is a post-training adaptation paradigm for LLMs and other generative models. RSFT fine-tunes a model on its self-generated outputs, employing a binary selection scheme: only outputs meeting an external correctness or reward criterion are retained as supervision, while all others are discarded. This section provides an in-depth characterization, technical workflow, theoretical underpinnings, variants, limitations, and empirical landscape of RSFT in natural language, mathematical reasoning, preference modeling, and beyond.

1. Formal Definition and Core Workflow

In RSFT, a pre-trained (or SFT) model πθ\pi_\theta generates multiple candidate outputs for a given input, and an external correctness criterion or scalar reward function r(x,y)r(x,y) is applied. Trajectories (e.g., solution steps, completions, or responses), are accepted into the training corpus if and only if they meet a fixed reward threshold, typically r(x,y)=1r(x,y) = 1 for correct responses and r(x,y)=0r(x,y) = 0 for incorrect ones (Deng et al., 4 Feb 2026, Koh et al., 22 May 2025, Liu et al., 14 Jan 2026).

The canonical RSFT loop for language and reasoning tasks is:

  1. Candidate Generation: For each xx (problem, prompt, etc.), sample KK outputs Y(x)={y1,,yK}Y(x) = \{ y_1,\dots,y_K\} independently from πθ(yx)\pi_\theta(y|x).
  2. Scoring: For every candidate yky_k, evaluate r(x,yk)r(x, y_k) (e.g., did it reach the correct boxed answer?).
  3. Selection: Retain only Y+(x)={y:r(x,y)=1}Y^+(x) = \{ y : r(x,y) = 1 \} ("positive" set), discarding Y(x)=Y(x)Y+(x)Y^-(x) = Y(x) \setminus Y^+(x) ("negative" set).
  4. Supervised Fine-Tuning: Train the next iteration's parameters θ\theta' by maximizing (or minimizing the negative) sequence or token-level likelihood on DRFT={(x,y):yY+(x)}D_\mathrm{RFT} = \{ (x,y) : y \in Y^+(x) \}:

LSFT(θ)=(x,y)DRFTt=1Tlogpθ(yty<t,x).\mathcal{L}_\mathrm{SFT}(\theta) = -\sum_{(x,y)\in D_\mathrm{RFT}}\sum_{t=1}^T \log p_\theta(y_t | y_{<t}, x).

No modifications to the architecture, reward model, or critic are required; RSFT is standard sequence-to-sequence fine-tuning on a binarily filtered dataset (Deng et al., 4 Feb 2026, Liu et al., 14 Jan 2026, Khaki et al., 2024).

2. Theoretical Characterization

RSFT can be understood as constructing an empirical distribution of "correct" trajectories, approximating the conditional distribution over correct output sequences. Formally, for a model distribution T(yx)T(y|x) and indicator correctness function corr(y)\operatorname{corr}(y), RSFT's supervision distribution is

PRFT(yx)T(yx)1{corr(y)=1}P_\mathrm{RFT}(y|x) \propto T(y|x) \cdot \mathbf{1}\{\operatorname{corr}(y) = 1\}

as described in (Deng et al., 4 Feb 2026, Koh et al., 22 May 2025).

This mechanism implements a form of rejection sampling: generate candidate samples from a proposal T(yx)T(y|x), accept if yy is correct, otherwise reject. The retained samples are then used to fit the model under maximum likelihood, restricting the support to only correct-completion regions.

In preference optimization or RLHF settings, RSFT can also generate contrastive preference pairs or logit-based supervision by sampling multiple outputs and filtering by reward differences (Khaki et al., 2024, Liu et al., 2023). In this context, RSFT acts as an offline mechanism for constructing high-quality or high-gap supervision pairs from a given base policy.

3. Variants and Algorithmic Enhancements

While canonical RSFT employs strict binary rejection, various adaptive or reward-weighted enhancements have been proposed to address its shortcomings.

  • Adaptive STaR (AdaSTaR): Introduces adaptive sampling for both diversity and curriculum. A min-heap tracks both the recency and difficulty (based on "win rate") of each data point, prioritizing under-sampled and harder examples. The update mix is batch-scheduled according to current model competence αt\alpha^t, blending easier data early in training and phasing it out as accuracy rises (Koh et al., 22 May 2025).
  • Hybrid RSFT in Preference Optimization: RSFT is used in combination with Direct Preference Optimization (DPO) to synthesize contrastive preference datasets, where only pairs with high enough reward margin are retained (Khaki et al., 2024).
  • Reward Informed Fine-Tuning (RIFT): Relaxes the hard rejection predicate, reuses all samples and introduces a mixed objective: positive reward trajectories reinforce via weighted log-likelihood, while negative ones are suppressed with a stable linear term. This improves statistical efficiency and allows explicit penalization of failure modes (Liu et al., 14 Jan 2026).
  • Trajectory Fusion (TrajFusion): Instead of discarding all incorrect trajectories, TrajFusion fuses them with correct ones, interleaving them to explicitly model trial-and-error reasoning and reflection. When error signals are uninformative, the method reduces to vanilla RSFT (Deng et al., 4 Feb 2026).
  • Combinatorial and Analytic Rejection Samplers: In probabilistic or combinatorial generation, similar accept-reject mechanisms are employed to tune samplers to target distributions or object sizes, where rejections correct for approximation or suboptimal parameters (Bodini et al., 2013).

A summary of key variants:

Variant Mechanism Reference
Canonical RSFT Binary accept/reject on correctness (Deng et al., 4 Feb 2026)
AdaSTaR Adaptive, curriculum/diversity (Koh et al., 22 May 2025)
RS-DPO Contrastive sample filtering (Khaki et al., 2024)
RIFT Reward-weighted, negative included (Liu et al., 14 Jan 2026)
TrajFusion Fused trial/error+reflection (Deng et al., 4 Feb 2026)

4. Empirical Results and Benchmarks

RSFT has become the prevailing baseline for self-improvement in mathematical reasoning and chain-of-thought LLMs. Comprehensive experiments across math (GSM8K, MATH, DeepMath, AIME, CollegeMath, TheoremQA, OlympiadBench), reasoning (ARC-Challenge, CommonsenseQA, CLadder 1.5, ANLI), and summarization/dialogue benchmarks validate its effectiveness (Deng et al., 4 Feb 2026, Koh et al., 22 May 2025, Liu et al., 14 Jan 2026, Khaki et al., 2024, Liu et al., 2023).

Key empirical findings:

  • RSFT baseline performance: Consistently improves over zero-shot or unfiltered self-generated finetuning. For instance, on Qwen-2.5-Math-1.5B, RSFT achieved mean@8 accuracy 25.6% on MATH (Liu et al., 14 Jan 2026).
  • Limitation in efficiency: RSFT's strict rejection discards the majority of computationally expensive samples. Only a small set of correct outputs are used, reducing data and learning signal efficiency, especially as task complexity increases or model quality drops.
  • Supervised baselines versus RSFT: Methods that reuse negative trajectories (e.g., RIFT) or diversify the sample space (e.g., AdaSTaR, TrajFusion) typically outperform strict RSFT, achieving substantial accuracy gains and sample efficiency—e.g., RIFT improves mean@8 by +11.4 points and pass@8 by +19.1 points over RSFT (Liu et al., 14 Jan 2026), and TrajFusion outperforms RSFT on long-form math problems (Deng et al., 4 Feb 2026).
  • Preference optimization: In RLHF and preference-based alignment, hybrid RSFT pipelines (e.g., RS-DPO, RSO) yield higher reward-model and human win rates than direct or SFT-only methods, with RSO achieving best-in-class gold-reward win rates (Liu et al., 2023).

5. Limitations and Diagnostic Analysis

Major limitations of RSFT include:

  • Signal Destruction: All near-miss or incorrect reasoning trajectories are discarded, erasing information on model failure modes, typical error patterns, and error diversity. This impedes learning to recognize and avoid frequent mistakes (Deng et al., 4 Feb 2026, Liu et al., 14 Jan 2026).
  • Data Inefficiency: The cost of generating and verifying numerous candidates is wasted when most are rejected. For challenging tasks or weaker base models, almost all trajectories may be incorrect, producing few or no positive samples and limiting learning progress (Liu et al., 14 Jan 2026).
  • No Contrastive Signal: In extremes (all correct or all wrong), RSFT provides no supervision or contrast, as only "accepted" samples are observed by the model.
  • Overfitting to Success: RSFT may induce the model to overestimate the likelihood of correct answers and under-represent failure regions, resulting in poor calibration around ambiguous or adversarial domains (Deng et al., 4 Feb 2026).

A plausible implication is that richer trial-and-error signals, adaptive curricula, and reward-balanced losses are necessary for robust self-improvement, especially in settings with sparse correctness.

6. Methodological Scope: Applications Beyond Mathematical Reasoning

While initially developed for chain-of-thought mathematical reasoning (Deng et al., 4 Feb 2026, Koh et al., 22 May 2025), RSFT's accept-reject paradigm generalizes:

  • RLHF and Alignment: Filtering model outputs by reward model gaps or preference probabilities, generating synthetic contrastive datasets for DPO (Khaki et al., 2024, Liu et al., 2023).
  • Robotic Manipulation: Online rejection sampling with human-in-the-loop corrections, accepting entire rollouts only if their total reward exceeds a dynamic threshold. This technique demonstrably improves policy robustness and error recovery in real-world manipulation (Lu et al., 30 Oct 2025).
  • Combinatorial and Statistical Sampling: Tuning proposal distributions, correcting for analytic bias in sampled objects, and calibrating generators via envelope parameters and rejection (Bodini et al., 2013).

RSFT shares conceptual ties with other offline RL and self-alignment techniques, including:

  • Supervised Fine-Tuning (SFT): RSFT is a strict filter over SFT- or self-generated data.
  • Direct Preference Optimization (DPO) and SLiC: These methods optimize models via logistic or hinge losses over labeled preference pairs. RSFT provides a mechanism to source higher-quality or on-policy preference pairs (Liu et al., 2023).
  • Reward-Informed Fine-Tuning (RIFT): RIFT generalizes RSFT by stably integrating both positive and negative samples into the loss, improving generalization and efficiency (Liu et al., 14 Jan 2026).
  • Adaptive and Fusion strategies: Methods such as AdaSTaR and TrajFusion extend RSFT to cover adaptive curricula, diversity sampling, and explicit modeling of error and reflection (Deng et al., 4 Feb 2026, Koh et al., 22 May 2025).

Current research trends focus on leveraging the diagnostic structure of negative samples, adaptive sampling strategies, and fusion of correct and incorrect trajectories to further improve learning signal and robustness. Limitations of threshold-based hard rejection are being addressed by reward-informed objectives, curriculum learning, and multi-signal supervision frameworks.

References

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Rejection Sampling Fine-Tuning (RSFT).