Sequence-level iw-SFT for Fine-Tuning

Updated 6 January 2026

Sequence-level iw-SFT is a method that integrates importance weighting with supervised fine-tuning to achieve a tighter lower bound on the reinforcement learning objective.
It leverages curated data and computed trajectory-level weights via importance sampling to boost performance in large language models and control domains.
The approach maintains standard maximum likelihood frameworks while employing clipping and smoothing to control variance and ensure theoretical guarantees.

Sequence-level importance-weighted supervised fine-tuning (iw-SFT) is an enhancement of the standard supervised fine-tuning (SFT) paradigm used predominantly for LLMs and imitation learning in control domains. Sequence-level iw-SFT frames SFT within the context of reinforcement learning (RL), explicitly optimizing a tighter lower bound on the RL objective via importance sampling. This approach leverages curated or filtered data together with importance weights computed at the trajectory (sequence) level, yielding empirical improvements and theoretical guarantees compared to standard SFT. The method is typically implemented with minimal changes to standard maximum likelihood objectives and remains competitive with more advanced RL algorithms (Qin et al., 17 Jul 2025).

1. SFT as a Lower Bound on the RL Objective

Let a trajectory be denoted $\tau = (s_0, a_0, ..., s_T)$ , generated by a parameterized policy $\pi(\cdot|\cdot;\theta)$ . In RL, the objective is to maximize the expected return:

$J(\theta) = \mathbb{E}_{\tau \sim \pi(\cdot; \theta)}\left[R(\tau)\right], \quad R(\tau) = \sum_{t=0}^T r(s_t, a_t).$

In a sparse reward setting with only a terminal binary reward $r(T) \in \{0,1\}$ , $R(\tau) = \mathbb{I}\{\text{success}(\tau)\}$ . Given a reference policy $\pi_\text{ref}$ and data filtered to include only successful trajectories $D^+ = \{\tau_i: \text{success}(\tau_i) = 1\}$ , importance sampling and the inequality $x \geq 1+\ln x$ yield:

$J(\theta) \geq \mathbb{E}_{\tau \sim \pi_\text{ref}}\left[R(\tau) \ln \pi(\tau; \theta)\right] + \text{const.}$

For binary rewards, this reduces to a standard SFT objective:

$\mathcal{J}_\mathrm{SFT}(\theta) = \frac{1}{|D^+|} \sum_{\tau \in D^+} \ln \pi(\tau; \theta) \longleftrightarrow \max \mathcal{J}_\mathrm{SFT}(\theta),$

demonstrating that SFT maximizes a generally loose lower bound on the RL objective.

2. Derivation of the Sequence-level iw-SFT Objective

Sequence-level iw-SFT seeks to tighten the lower bound by introducing an auxiliary distribution $q(\tau)$ , ideally taken close to the target policy $\pi(\cdot; \theta)$ . By rewriting the RL objective:

$J(\theta) = \int \pi_\text{ref}(\tau) \frac{q(\tau)}{\pi_\text{ref}(\tau)} \frac{\pi(\tau; \theta)}{q(\tau)} R(\tau) d\tau,$

and applying $x \geq 1+\ln x$ to $x = \pi(\tau; \theta) / q(\tau)$ , the bound becomes

$J(\theta) \geq \int \pi_\text{ref}(\tau) \frac{q(\tau)}{\pi_\text{ref}(\tau)} R(\tau) \ln \pi(\tau; \theta) d\tau + \text{const.}$

In the sparse reward case, the sequence-level iw-SFT objective is

$\mathcal{J}_\mathrm{iw\text{-}SFT}(\theta) = \mathbb{E}_{\tau \sim D^+}\left[w(\tau) \ln \pi(\tau; \theta)\right], \quad w(\tau) = \frac{q(\tau)}{\pi_\text{ref}(\tau)}.$

Empirically, this expectation is computed using a mini-batch of size $B$ :

$\mathcal{L}_\mathrm{iw\text{-}SFT}(\theta) = -\frac{1}{B} \sum_{j=1}^B w(\tau^{(j)}) \ln \pi(\tau^{(j)}; \theta).$

For ordinal quality data scored into $C+1$ bins, one replaces $\mathbb{I}\{\tau \in D^+\}$ by $\sum_{i=0}^C \mathbb{I}\{S(\tau) > c_i\}$ , yielding the generalized iw-SFT(Q) variant with the same structure.

3. Construction and Computation of Importance Weights

The key novelty is the use of full-trajectory importance weights:

$w(\tau) = \frac{q(\tau)}{\pi_\text{ref}(\tau)} = \exp\left[\ln q(\tau) - \ln \pi_\text{ref}(\tau)\right].$

With $q(\tau)$ defined as $p(s_0) \prod_t p(s_{t+1}|s_t, a_t)\,\pi(a_t|s_t; \theta_q)$ , where $\theta_q$ is a delayed or exponential moving average (EMA) copy of $\theta$ , the logarithm of the sequence weight decomposes as:

$\ln w(\tau) = \sum_{t=0}^{T-1} \left[\ln \pi(a_t|s_t; \theta_q) - \ln \pi_\text{ref}(a_t|s_t)\right].$

4. Variance Control: Normalization, Clipping, and Smoothing

Unclipped importance weights $w(\tau)$ can exhibit high variance. Practical variance control includes:

Per-step clipping: Each token-level ratio is clipped: $\alpha_t = \mathrm{clip}\left(\frac{\pi(a_t|s_t; \theta_q)}{\pi_\text{ref}(a_t|s_t)}, \alpha_\text{min}, \alpha_\text{max}\right)$ , then $w(\tau) = \prod_t \alpha_t$ and final sequence weights are clipped to $[\beta_\text{min}, \beta_\text{max}]$ .
Smoothing function $g(\cdot)$ : Log-weights can be smoothed as

$\ln w(\tau) = \sum_{t=0}^{T-1} g\left(\ln \pi(a_t|s_t; \theta_q) - \ln \pi_\text{ref}(a_t|s_t)\right)$

where $g(x) = kx$ for $k \in [0, 1]$ acts as a temperature interpolating between no weighting ( $k \to 0$ ) and full weighting ( $k \to 1$ ).

5. Implementation Procedure

The algorithm requires: an initial parameter vector $\theta = \theta_\text{ref}$ (reference policy), an EMA copy $\theta_q = \theta_\text{ref}$ , and a filtered dataset $D^+$ (or $D^+_Q$ for quality-scored data). Key steps:

initialize θ = θ_ref     # start from reference (e.g. pre-trained) weights
initialize θ_q = θ_ref   # auxiliary policy
for iteration = 1…N:
  sample a batch {τ_j} from filtered dataset D⁺ (or D_Q⁺)
  for each τ_j in batch:
    compute per-step log-ratios ρ_{j,t} = log π(a_t|s_t;θ_q) – log π_ref(a_t|s_t)
    compute sequence weight:
      ln w_j = ∑_t g(ρ_{j,t})
      w_j = clip( exp(ln w_j), w_min, w_max )
  compute weighted NLL loss:
    L = –(1/B) ∑_{j=1}^B w_j · log π(τ_j;θ)
  θ ← θ – η·∇_θ L       # e.g. AdamW update
  every K steps: update θ_q ← (1–α)·θ_q + α·θ   # EMA of θ

Data requirements are a filtered set of successful or scored trajectories and a reference policy capable of evaluating

\pi_\text{ref}(a_t|s_t)

(Qin et al., 17 Jul 2025).

6. Theoretical Properties

Sequence-level iw-SFT provides a tighter lower bound on the RL objective than standard SFT. As the auxiliary distribution $q(\tau) \to \pi(\tau; \theta)$ , equality is achieved. Standard SFT, corresponding to constant $q = \pi_\text{ref}$ , always leaves a loose bound. A notable property is that iw-SFT reduces bias: it can, in principle, recover information from low-reward regions by down-weighting them, while vanilla SFT ignores them. No bias is introduced provided that the lower bound is maintained; however, practical stability requires controlling the variance of importance weights, justifying the use of clipping and smoothing.

7. Empirical Performance Summary

Empirical evaluations demonstrate consistent improvements for iw-SFT over conventional SFT:

Domain	SFT (%)	iw-SFT (%)	Notes
AIME 2024 (LLMs)	56.7	66.7	Qwen2.5-32B, 1K math traces
MATH 500 (LLMs)	94.4	94.8
GPQA (LLMs)	60.6	64.1
D4RL HalfCheetah-Med (RL)	39.3	40.9	SFT(Q) vs. iw-SFT(Q), competitive with AWAC (40.5)
Franka Kitchen fine-tuning	58.2	61.8	SFT(Q) vs. iw-SFT(Q), top 5% fine-tuning

Sequence-level weighting yields stable convergence without test-time "budget forcing".
On partially observed continuous control tasks (D4RL, Franka Kitchen), iw-SFT(Q) meets or surpasses advanced RL baselines such as AWAC and CQL.
Minor changes in implementation are sufficient for consistent gains (Qin et al., 17 Jul 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Supervised Fine Tuning on Curated Data is Reinforcement Learning (and can be improved) (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sequence-level iw-SFT.