Rejection Sampling Fine-Tuning (RSFT)

Updated 7 February 2026

RSFT is a fine-tuning paradigm that uses rejection sampling to filter self-generated outputs based on an external correctness or reward signal.
It systematically generates, scores, and selects candidate outputs, then uses only the correct responses for supervised model fine-tuning.
Variants such as AdaSTaR, RIFT, and TrajFusion enhance RSFT by incorporating adaptive sampling and reward-weighted objectives to overcome data inefficiency and overfitting.

Rejection Sampling Fine-Tuning (RSFT), also referred to as Rejection Sampling Fine-Tuning (RFT) or STaR in some literature, is a post-training adaptation paradigm for LLMs and other generative models. RSFT fine-tunes a model on its self-generated outputs, employing a binary selection scheme: only outputs meeting an external correctness or reward criterion are retained as supervision, while all others are discarded. This section provides an in-depth characterization, technical workflow, theoretical underpinnings, variants, limitations, and empirical landscape of RSFT in natural language, mathematical reasoning, preference modeling, and beyond.

1. Formal Definition and Core Workflow

In RSFT, a pre-trained (or SFT) model $\pi_\theta$ generates multiple candidate outputs for a given input, and an external correctness criterion or scalar reward function $r(x,y)$ is applied. Trajectories (e.g., solution steps, completions, or responses), are accepted into the training corpus if and only if they meet a fixed reward threshold, typically $r(x,y) = 1$ for correct responses and $r(x,y) = 0$ for incorrect ones (Deng et al., 4 Feb 2026, Koh et al., 22 May 2025, Liu et al., 14 Jan 2026).

The canonical RSFT loop for language and reasoning tasks is:

Candidate Generation: For each $x$ (problem, prompt, etc.), sample $K$ outputs $Y(x) = \{ y_1,\dots,y_K\}$ independently from $\pi_\theta(y|x)$ .
Scoring: For every candidate $y_k$ , evaluate $r(x, y_k)$ (e.g., did it reach the correct boxed answer?).
Selection: Retain only $Y^+(x) = \{ y : r(x,y) = 1 \}$ ("positive" set), discarding $Y^-(x) = Y(x) \setminus Y^+(x)$ ("negative" set).
Supervised Fine-Tuning: Train the next iteration's parameters $\theta'$ by maximizing (or minimizing the negative) sequence or token-level likelihood on $D_\mathrm{RFT} = \{ (x,y) : y \in Y^+(x) \}$ :

$\mathcal{L}_\mathrm{SFT}(\theta) = -\sum_{(x,y)\in D_\mathrm{RFT}}\sum_{t=1}^T \log p_\theta(y_t | y_{<t}, x).$

No modifications to the architecture, reward model, or critic are required; RSFT is standard sequence-to-sequence fine-tuning on a binarily filtered dataset (Deng et al., 4 Feb 2026, Liu et al., 14 Jan 2026, Khaki et al., 2024).

2. Theoretical Characterization

RSFT can be understood as constructing an empirical distribution of "correct" trajectories, approximating the conditional distribution over correct output sequences. Formally, for a model distribution $T(y|x)$ and indicator correctness function $\operatorname{corr}(y)$ , RSFT's supervision distribution is

$P_\mathrm{RFT}(y|x) \propto T(y|x) \cdot \mathbf{1}\{\operatorname{corr}(y) = 1\}$

as described in (Deng et al., 4 Feb 2026, Koh et al., 22 May 2025).

This mechanism implements a form of rejection sampling: generate candidate samples from a proposal $T(y|x)$ , accept if $y$ is correct, otherwise reject. The retained samples are then used to fit the model under maximum likelihood, restricting the support to only correct-completion regions.

In preference optimization or RLHF settings, RSFT can also generate contrastive preference pairs or logit-based supervision by sampling multiple outputs and filtering by reward differences (Khaki et al., 2024, Liu et al., 2023). In this context, RSFT acts as an offline mechanism for constructing high-quality or high-gap supervision pairs from a given base policy.

3. Variants and Algorithmic Enhancements

While canonical RSFT employs strict binary rejection, various adaptive or reward-weighted enhancements have been proposed to address its shortcomings.

Adaptive STaR (AdaSTaR): Introduces adaptive sampling for both diversity and curriculum. A min-heap tracks both the recency and difficulty (based on "win rate") of each data point, prioritizing under-sampled and harder examples. The update mix is batch-scheduled according to current model competence $\alpha^t$ , blending easier data early in training and phasing it out as accuracy rises (Koh et al., 22 May 2025).
Hybrid RSFT in Preference Optimization: RSFT is used in combination with Direct Preference Optimization (DPO) to synthesize contrastive preference datasets, where only pairs with high enough reward margin are retained (Khaki et al., 2024).
Reward Informed Fine-Tuning (RIFT): Relaxes the hard rejection predicate, reuses all samples and introduces a mixed objective: positive reward trajectories reinforce via weighted log-likelihood, while negative ones are suppressed with a stable linear term. This improves statistical efficiency and allows explicit penalization of failure modes (Liu et al., 14 Jan 2026).
Trajectory Fusion (TrajFusion): Instead of discarding all incorrect trajectories, TrajFusion fuses them with correct ones, interleaving them to explicitly model trial-and-error reasoning and reflection. When error signals are uninformative, the method reduces to vanilla RSFT (Deng et al., 4 Feb 2026).
Combinatorial and Analytic Rejection Samplers: In probabilistic or combinatorial generation, similar accept-reject mechanisms are employed to tune samplers to target distributions or object sizes, where rejections correct for approximation or suboptimal parameters (Bodini et al., 2013).

A summary of key variants:

Variant	Mechanism	Reference
Canonical RSFT	Binary accept/reject on correctness	(Deng et al., 4 Feb 2026)
AdaSTaR	Adaptive, curriculum/diversity	(Koh et al., 22 May 2025)
RS-DPO	Contrastive sample filtering	(Khaki et al., 2024)
RIFT	Reward-weighted, negative included	(Liu et al., 14 Jan 2026)
TrajFusion	Fused trial/error+reflection	(Deng et al., 4 Feb 2026)

4. Empirical Results and Benchmarks

RSFT has become the prevailing baseline for self-improvement in mathematical reasoning and chain-of-thought LLMs. Comprehensive experiments across math (GSM8K, MATH, DeepMath, AIME, CollegeMath, TheoremQA, OlympiadBench), reasoning (ARC-Challenge, CommonsenseQA, CLadder 1.5, ANLI), and summarization/dialogue benchmarks validate its effectiveness (Deng et al., 4 Feb 2026, Koh et al., 22 May 2025, Liu et al., 14 Jan 2026, Khaki et al., 2024, Liu et al., 2023).

Key empirical findings:

RSFT baseline performance: Consistently improves over zero-shot or unfiltered self-generated finetuning. For instance, on Qwen-2.5-Math-1.5B, RSFT achieved mean@8 accuracy 25.6% on MATH (Liu et al., 14 Jan 2026).
Limitation in efficiency: RSFT's strict rejection discards the majority of computationally expensive samples. Only a small set of correct outputs are used, reducing data and learning signal efficiency, especially as task complexity increases or model quality drops.
Supervised baselines versus RSFT: Methods that reuse negative trajectories (e.g., RIFT) or diversify the sample space (e.g., AdaSTaR, TrajFusion) typically outperform strict RSFT, achieving substantial accuracy gains and sample efficiency—e.g., RIFT improves mean@8 by +11.4 points and pass@8 by +19.1 points over RSFT (Liu et al., 14 Jan 2026), and TrajFusion outperforms RSFT on long-form math problems (Deng et al., 4 Feb 2026).
Preference optimization: In RLHF and preference-based alignment, hybrid RSFT pipelines (e.g., RS-DPO, RSO) yield higher reward-model and human win rates than direct or SFT-only methods, with RSO achieving best-in-class gold-reward win rates (Liu et al., 2023).

5. Limitations and Diagnostic Analysis

Major limitations of RSFT include:

Signal Destruction: All near-miss or incorrect reasoning trajectories are discarded, erasing information on model failure modes, typical error patterns, and error diversity. This impedes learning to recognize and avoid frequent mistakes (Deng et al., 4 Feb 2026, Liu et al., 14 Jan 2026).
Data Inefficiency: The cost of generating and verifying numerous candidates is wasted when most are rejected. For challenging tasks or weaker base models, almost all trajectories may be incorrect, producing few or no positive samples and limiting learning progress (Liu et al., 14 Jan 2026).
No Contrastive Signal: In extremes (all correct or all wrong), RSFT provides no supervision or contrast, as only "accepted" samples are observed by the model.
Overfitting to Success: RSFT may induce the model to overestimate the likelihood of correct answers and under-represent failure regions, resulting in poor calibration around ambiguous or adversarial domains (Deng et al., 4 Feb 2026).

A plausible implication is that richer trial-and-error signals, adaptive curricula, and reward-balanced losses are necessary for robust self-improvement, especially in settings with sparse correctness.

6. Methodological Scope: Applications Beyond Mathematical Reasoning

While initially developed for chain-of-thought mathematical reasoning (Deng et al., 4 Feb 2026, Koh et al., 22 May 2025), RSFT's accept-reject paradigm generalizes:

RLHF and Alignment: Filtering model outputs by reward model gaps or preference probabilities, generating synthetic contrastive datasets for DPO (Khaki et al., 2024, Liu et al., 2023).
Robotic Manipulation: Online rejection sampling with human-in-the-loop corrections, accepting entire rollouts only if their total reward exceeds a dynamic threshold. This technique demonstrably improves policy robustness and error recovery in real-world manipulation (Lu et al., 30 Oct 2025).
Combinatorial and Statistical Sampling: Tuning proposal distributions, correcting for analytic bias in sampled objects, and calibrating generators via envelope parameters and rejection (Bodini et al., 2013).

RSFT shares conceptual ties with other offline RL and self-alignment techniques, including:

Supervised Fine-Tuning (SFT): RSFT is a strict filter over SFT- or self-generated data.
Direct Preference Optimization (DPO) and SLiC: These methods optimize models via logistic or hinge losses over labeled preference pairs. RSFT provides a mechanism to source higher-quality or on-policy preference pairs (Liu et al., 2023).
Reward-Informed Fine-Tuning (RIFT): RIFT generalizes RSFT by stably integrating both positive and negative samples into the loss, improving generalization and efficiency (Liu et al., 14 Jan 2026).
Adaptive and Fusion strategies: Methods such as AdaSTaR and TrajFusion extend RSFT to cover adaptive curricula, diversity sampling, and explicit modeling of error and reflection (Deng et al., 4 Feb 2026, Koh et al., 22 May 2025).

Current research trends focus on leveraging the diagnostic structure of negative samples, adaptive sampling strategies, and fusion of correct and incorrect trajectories to further improve learning signal and robustness. Limitations of threshold-based hard rejection are being addressed by reward-informed objectives, curriculum learning, and multi-signal supervision frameworks.

References

"Beyond Rejection Sampling: Trajectory Fusion for Scaling Mathematical Reasoning" (Deng et al., 4 Feb 2026)
"AdaSTaR: Adaptive Data Sampling for Training Self-Taught Reasoners" (Koh et al., 22 May 2025)
"RIFT: Repurposing Negative Samples via Reward-Informed Fine-Tuning" (Liu et al., 14 Jan 2026)
"RS-DPO: A Hybrid Rejection Sampling and Direct Preference Optimization Method for Alignment of LLMs" (Khaki et al., 2024)
"Statistical Rejection Sampling Improves Preference Optimization" (Liu et al., 2023)
"Human-in-the-loop Online Rejection Sampling for Robotic Manipulation" (Lu et al., 30 Oct 2025)
"Analytic Samplers and the Combinatorial Rejection Method" (Bodini et al., 2013)