Sequence-level iw-SFT for Fine-Tuning
- Sequence-level iw-SFT is a method that integrates importance weighting with supervised fine-tuning to achieve a tighter lower bound on the reinforcement learning objective.
- It leverages curated data and computed trajectory-level weights via importance sampling to boost performance in large language models and control domains.
- The approach maintains standard maximum likelihood frameworks while employing clipping and smoothing to control variance and ensure theoretical guarantees.
Sequence-level importance-weighted supervised fine-tuning (iw-SFT) is an enhancement of the standard supervised fine-tuning (SFT) paradigm used predominantly for LLMs and imitation learning in control domains. Sequence-level iw-SFT frames SFT within the context of reinforcement learning (RL), explicitly optimizing a tighter lower bound on the RL objective via importance sampling. This approach leverages curated or filtered data together with importance weights computed at the trajectory (sequence) level, yielding empirical improvements and theoretical guarantees compared to standard SFT. The method is typically implemented with minimal changes to standard maximum likelihood objectives and remains competitive with more advanced RL algorithms (Qin et al., 17 Jul 2025).
1. SFT as a Lower Bound on the RL Objective
Let a trajectory be denoted , generated by a parameterized policy . In RL, the objective is to maximize the expected return:
In a sparse reward setting with only a terminal binary reward , . Given a reference policy and data filtered to include only successful trajectories , importance sampling and the inequality yield:
For binary rewards, this reduces to a standard SFT objective:
demonstrating that SFT maximizes a generally loose lower bound on the RL objective.
2. Derivation of the Sequence-level iw-SFT Objective
Sequence-level iw-SFT seeks to tighten the lower bound by introducing an auxiliary distribution , ideally taken close to the target policy . By rewriting the RL objective:
and applying to , the bound becomes
In the sparse reward case, the sequence-level iw-SFT objective is
Empirically, this expectation is computed using a mini-batch of size :
For ordinal quality data scored into bins, one replaces by , yielding the generalized iw-SFT(Q) variant with the same structure.
3. Construction and Computation of Importance Weights
The key novelty is the use of full-trajectory importance weights:
With defined as , where is a delayed or exponential moving average (EMA) copy of , the logarithm of the sequence weight decomposes as:
4. Variance Control: Normalization, Clipping, and Smoothing
Unclipped importance weights can exhibit high variance. Practical variance control includes:
- Per-step clipping: Each token-level ratio is clipped: , then and final sequence weights are clipped to .
- Smoothing function : Log-weights can be smoothed as
where for acts as a temperature interpolating between no weighting () and full weighting ().
5. Implementation Procedure
The algorithm requires: an initial parameter vector (reference policy), an EMA copy , and a filtered dataset (or for quality-scored data). Key steps:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
initialize θ = θ_ref # start from reference (e.g. pre-trained) weights initialize θ_q = θ_ref # auxiliary policy for iteration = 1…N: sample a batch {τ_j} from filtered dataset D⁺ (or D_Q⁺) for each τ_j in batch: compute per-step log-ratios ρ_{j,t} = log π(a_t|s_t;θ_q) – log π_ref(a_t|s_t) compute sequence weight: ln w_j = ∑_t g(ρ_{j,t}) w_j = clip( exp(ln w_j), w_min, w_max ) compute weighted NLL loss: L = –(1/B) ∑_{j=1}^B w_j · log π(τ_j;θ) θ ← θ – η·∇_θ L # e.g. AdamW update every K steps: update θ_q ← (1–α)·θ_q + α·θ # EMA of θ |
6. Theoretical Properties
Sequence-level iw-SFT provides a tighter lower bound on the RL objective than standard SFT. As the auxiliary distribution , equality is achieved. Standard SFT, corresponding to constant , always leaves a loose bound. A notable property is that iw-SFT reduces bias: it can, in principle, recover information from low-reward regions by down-weighting them, while vanilla SFT ignores them. No bias is introduced provided that the lower bound is maintained; however, practical stability requires controlling the variance of importance weights, justifying the use of clipping and smoothing.
7. Empirical Performance Summary
Empirical evaluations demonstrate consistent improvements for iw-SFT over conventional SFT:
| Domain | SFT (%) | iw-SFT (%) | Notes |
|---|---|---|---|
| AIME 2024 (LLMs) | 56.7 | 66.7 | Qwen2.5-32B, 1K math traces |
| MATH 500 (LLMs) | 94.4 | 94.8 | |
| GPQA (LLMs) | 60.6 | 64.1 | |
| D4RL HalfCheetah-Med (RL) | 39.3 | 40.9 | SFT(Q) vs. iw-SFT(Q), competitive with AWAC (40.5) |
| Franka Kitchen fine-tuning | 58.2 | 61.8 | SFT(Q) vs. iw-SFT(Q), top 5% fine-tuning |
- Sequence-level weighting yields stable convergence without test-time "budget forcing".
- On partially observed continuous control tasks (D4RL, Franka Kitchen), iw-SFT(Q) meets or surpasses advanced RL baselines such as AWAC and CQL.
- Minor changes in implementation are sufficient for consistent gains (Qin et al., 17 Jul 2025).