Papers
Topics
Authors
Recent
Search
2000 character limit reached

Sequence-level iw-SFT for Fine-Tuning

Updated 6 January 2026
  • Sequence-level iw-SFT is a method that integrates importance weighting with supervised fine-tuning to achieve a tighter lower bound on the reinforcement learning objective.
  • It leverages curated data and computed trajectory-level weights via importance sampling to boost performance in large language models and control domains.
  • The approach maintains standard maximum likelihood frameworks while employing clipping and smoothing to control variance and ensure theoretical guarantees.

Sequence-level importance-weighted supervised fine-tuning (iw-SFT) is an enhancement of the standard supervised fine-tuning (SFT) paradigm used predominantly for LLMs and imitation learning in control domains. Sequence-level iw-SFT frames SFT within the context of reinforcement learning (RL), explicitly optimizing a tighter lower bound on the RL objective via importance sampling. This approach leverages curated or filtered data together with importance weights computed at the trajectory (sequence) level, yielding empirical improvements and theoretical guarantees compared to standard SFT. The method is typically implemented with minimal changes to standard maximum likelihood objectives and remains competitive with more advanced RL algorithms (Qin et al., 17 Jul 2025).

1. SFT as a Lower Bound on the RL Objective

Let a trajectory be denoted τ=(s0,a0,...,sT)\tau = (s_0, a_0, ..., s_T), generated by a parameterized policy π(;θ)\pi(\cdot|\cdot;\theta). In RL, the objective is to maximize the expected return:

J(θ)=Eτπ(;θ)[R(τ)],R(τ)=t=0Tr(st,at).J(\theta) = \mathbb{E}_{\tau \sim \pi(\cdot; \theta)}\left[R(\tau)\right], \quad R(\tau) = \sum_{t=0}^T r(s_t, a_t).

In a sparse reward setting with only a terminal binary reward r(T){0,1}r(T) \in \{0,1\}, R(τ)=I{success(τ)}R(\tau) = \mathbb{I}\{\text{success}(\tau)\}. Given a reference policy πref\pi_\text{ref} and data filtered to include only successful trajectories D+={τi:success(τi)=1}D^+ = \{\tau_i: \text{success}(\tau_i) = 1\}, importance sampling and the inequality x1+lnxx \geq 1+\ln x yield:

J(θ)Eτπref[R(τ)lnπ(τ;θ)]+const.J(\theta) \geq \mathbb{E}_{\tau \sim \pi_\text{ref}}\left[R(\tau) \ln \pi(\tau; \theta)\right] + \text{const.}

For binary rewards, this reduces to a standard SFT objective:

JSFT(θ)=1D+τD+lnπ(τ;θ)maxJSFT(θ),\mathcal{J}_\mathrm{SFT}(\theta) = \frac{1}{|D^+|} \sum_{\tau \in D^+} \ln \pi(\tau; \theta) \longleftrightarrow \max \mathcal{J}_\mathrm{SFT}(\theta),

demonstrating that SFT maximizes a generally loose lower bound on the RL objective.

2. Derivation of the Sequence-level iw-SFT Objective

Sequence-level iw-SFT seeks to tighten the lower bound by introducing an auxiliary distribution q(τ)q(\tau), ideally taken close to the target policy π(;θ)\pi(\cdot; \theta). By rewriting the RL objective:

J(θ)=πref(τ)q(τ)πref(τ)π(τ;θ)q(τ)R(τ)dτ,J(\theta) = \int \pi_\text{ref}(\tau) \frac{q(\tau)}{\pi_\text{ref}(\tau)} \frac{\pi(\tau; \theta)}{q(\tau)} R(\tau) d\tau,

and applying x1+lnxx \geq 1+\ln x to x=π(τ;θ)/q(τ)x = \pi(\tau; \theta) / q(\tau), the bound becomes

J(θ)πref(τ)q(τ)πref(τ)R(τ)lnπ(τ;θ)dτ+const.J(\theta) \geq \int \pi_\text{ref}(\tau) \frac{q(\tau)}{\pi_\text{ref}(\tau)} R(\tau) \ln \pi(\tau; \theta) d\tau + \text{const.}

In the sparse reward case, the sequence-level iw-SFT objective is

Jiw-SFT(θ)=EτD+[w(τ)lnπ(τ;θ)],w(τ)=q(τ)πref(τ).\mathcal{J}_\mathrm{iw\text{-}SFT}(\theta) = \mathbb{E}_{\tau \sim D^+}\left[w(\tau) \ln \pi(\tau; \theta)\right], \quad w(\tau) = \frac{q(\tau)}{\pi_\text{ref}(\tau)}.

Empirically, this expectation is computed using a mini-batch of size BB:

Liw-SFT(θ)=1Bj=1Bw(τ(j))lnπ(τ(j);θ).\mathcal{L}_\mathrm{iw\text{-}SFT}(\theta) = -\frac{1}{B} \sum_{j=1}^B w(\tau^{(j)}) \ln \pi(\tau^{(j)}; \theta).

For ordinal quality data scored into C+1C+1 bins, one replaces I{τD+}\mathbb{I}\{\tau \in D^+\} by i=0CI{S(τ)>ci}\sum_{i=0}^C \mathbb{I}\{S(\tau) > c_i\}, yielding the generalized iw-SFT(Q) variant with the same structure.

3. Construction and Computation of Importance Weights

The key novelty is the use of full-trajectory importance weights:

w(τ)=q(τ)πref(τ)=exp[lnq(τ)lnπref(τ)].w(\tau) = \frac{q(\tau)}{\pi_\text{ref}(\tau)} = \exp\left[\ln q(\tau) - \ln \pi_\text{ref}(\tau)\right].

With q(τ)q(\tau) defined as p(s0)tp(st+1st,at)π(atst;θq)p(s_0) \prod_t p(s_{t+1}|s_t, a_t)\,\pi(a_t|s_t; \theta_q), where θq\theta_q is a delayed or exponential moving average (EMA) copy of θ\theta, the logarithm of the sequence weight decomposes as:

lnw(τ)=t=0T1[lnπ(atst;θq)lnπref(atst)].\ln w(\tau) = \sum_{t=0}^{T-1} \left[\ln \pi(a_t|s_t; \theta_q) - \ln \pi_\text{ref}(a_t|s_t)\right].

4. Variance Control: Normalization, Clipping, and Smoothing

Unclipped importance weights w(τ)w(\tau) can exhibit high variance. Practical variance control includes:

  • Per-step clipping: Each token-level ratio is clipped: αt=clip(π(atst;θq)πref(atst),αmin,αmax)\alpha_t = \mathrm{clip}\left(\frac{\pi(a_t|s_t; \theta_q)}{\pi_\text{ref}(a_t|s_t)}, \alpha_\text{min}, \alpha_\text{max}\right), then w(τ)=tαtw(\tau) = \prod_t \alpha_t and final sequence weights are clipped to [βmin,βmax][\beta_\text{min}, \beta_\text{max}].
  • Smoothing function g()g(\cdot): Log-weights can be smoothed as

lnw(τ)=t=0T1g(lnπ(atst;θq)lnπref(atst))\ln w(\tau) = \sum_{t=0}^{T-1} g\left(\ln \pi(a_t|s_t; \theta_q) - \ln \pi_\text{ref}(a_t|s_t)\right)

where g(x)=kxg(x) = kx for k[0,1]k \in [0, 1] acts as a temperature interpolating between no weighting (k0k \to 0) and full weighting (k1k \to 1).

5. Implementation Procedure

The algorithm requires: an initial parameter vector θ=θref\theta = \theta_\text{ref} (reference policy), an EMA copy θq=θref\theta_q = \theta_\text{ref}, and a filtered dataset D+D^+ (or DQ+D^+_Q for quality-scored data). Key steps:

1
2
3
4
5
6
7
8
9
10
11
12
13
initialize θ = θ_ref     # start from reference (e.g. pre-trained) weights
initialize θ_q = θ_ref   # auxiliary policy
for iteration = 1N:
  sample a batch {τ_j} from filtered dataset D (or D_Q)
  for each τ_j in batch:
    compute per-step log-ratios ρ_{j,t} = log π(a_t|s_t;θ_q)  log π_ref(a_t|s_t)
    compute sequence weight:
      ln w_j = _t g(ρ_{j,t})
      w_j = clip( exp(ln w_j), w_min, w_max )
  compute weighted NLL loss:
    L = (1/B) _{j=1}^B w_j · log π(τ_j;θ)
  θ  θ  η·_θ L       # e.g. AdamW update
  every K steps: update θ_q  (1α)·θ_q + α·θ   # EMA of θ
Data requirements are a filtered set of successful or scored trajectories and a reference policy capable of evaluating πref(atst)\pi_\text{ref}(a_t|s_t) (Qin et al., 17 Jul 2025).

6. Theoretical Properties

Sequence-level iw-SFT provides a tighter lower bound on the RL objective than standard SFT. As the auxiliary distribution q(τ)π(τ;θ)q(\tau) \to \pi(\tau; \theta), equality is achieved. Standard SFT, corresponding to constant q=πrefq = \pi_\text{ref}, always leaves a loose bound. A notable property is that iw-SFT reduces bias: it can, in principle, recover information from low-reward regions by down-weighting them, while vanilla SFT ignores them. No bias is introduced provided that the lower bound is maintained; however, practical stability requires controlling the variance of importance weights, justifying the use of clipping and smoothing.

7. Empirical Performance Summary

Empirical evaluations demonstrate consistent improvements for iw-SFT over conventional SFT:

Domain SFT (%) iw-SFT (%) Notes
AIME 2024 (LLMs) 56.7 66.7 Qwen2.5-32B, 1K math traces
MATH 500 (LLMs) 94.4 94.8
GPQA (LLMs) 60.6 64.1
D4RL HalfCheetah-Med (RL) 39.3 40.9 SFT(Q) vs. iw-SFT(Q), competitive with AWAC (40.5)
Franka Kitchen fine-tuning 58.2 61.8 SFT(Q) vs. iw-SFT(Q), top 5% fine-tuning
  • Sequence-level weighting yields stable convergence without test-time "budget forcing".
  • On partially observed continuous control tasks (D4RL, Franka Kitchen), iw-SFT(Q) meets or surpasses advanced RL baselines such as AWAC and CQL.
  • Minor changes in implementation are sufficient for consistent gains (Qin et al., 17 Jul 2025).
Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sequence-level iw-SFT.