Papers
Topics
Authors
Recent
Search
2000 character limit reached

Proximal SFT (PSFT)

Updated 19 January 2026
  • Proximal SFT (PSFT) is a supervised training paradigm that incorporates trust-region constraints to restrict per-token probability updates and prevent policy drift.
  • It adapts PPO’s surrogate objective to fine-tuning by clipping probability ratios, reducing gradient variance and avoiding entropy collapse.
  • Empirical results show PSFT improves out-of-domain performance and alignment compared to standard SFT, making it effective for robust model specialization.

Proximal Supervised Fine-Tuning (PSFT) is a supervised training paradigm for large foundation models that introduces trust-region constraints into the standard fine-tuning process to address generalization decay and capability preservation. Rather than minimizing cross-entropy loss alone, PSFT applies a proximal objective—drawing on principles from Trust-Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) in reinforcement learning—to restrict per-token probability updates, thereby constraining policy drift and mitigating @@@@10@@@@.

1. Theoretical Framework and Loss Derivation

PSFT is formulated by reinterpreting supervised fine-tuning (SFT) as a degenerate policy-gradient update. In typical RL, policy-gradient loss is expressed as:

LPG(θ)=E(st,at)πθ[logπθ(atst)A^t]L^{\mathrm{PG}}(\theta) = -\mathbb{E}_{(s_t, a_t) \sim \pi_\theta} \left[ \log \pi_\theta(a_t | s_t) \, \widehat{A}_t \right]

where A^t\widehat{A}_t is the advantage. SFT minimizes cross-entropy:

LSFT(θ)=E(st,at)D[logπθ(atst)]L^{\mathrm{SFT}}(\theta) = -\mathbb{E}_{(s_t, a^*_t) \sim \mathcal{D}} \left[ \log \pi_\theta(a^*_t|s_t) \right]

Recasting SFT as a policy-gradient update for an offline dataset D\mathcal{D}, every supervised token is treated as optimal (i.e., A^t=1\widehat{A}_t = 1), producing a single-step gradient that is unconstrained in the policy space. This lack of a trust region can result in overfitting and degeneration of pre-existing model capabilities.

TRPO introduces a hard KL-divergence constraint, and PPO replaces this with a clipped surrogate objective. PSFT adapts PPO’s surrogate to the supervised setting, yielding a new loss:

LPSFT(θ)=E(st,at)D[min(rt(θ),clip(rt(θ),1ϵ,1+ϵ))]L^{\mathrm{PSFT}}(\theta) = \mathbb{E}_{(s_t, a_t)\sim\mathcal{D}} \left[ \min\left( r_t(\theta), \mathrm{clip}\left(r_t(\theta), 1-\epsilon, 1+\epsilon\right) \right) \right]

with

rt(θ)=πθ(atst)πθold(atst)r_t(\theta) = \frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}

where θold\theta_{\text{old}} is a reference parameter snapshot, updated after each batch or epoch. The gradient of the PSFT objective further masks tokens exceeding the clipping bounds, promoting conservative update steps.

2. Training Objective and Algorithmic Structure

PSFT’s main objective maximizes the clipped surrogate over each (s,a)(s, a) pair drawn from the training data:

maxθ  LPSFT(θ)=E(s,a)D[min(r(θ),clip(r(θ),1ϵ,1+ϵ))]\max_{\theta}\;L^{\mathrm{PSFT}}(\theta) = \mathbb{E}_{(s, a)\sim\mathcal{D}} \left[ \min\left(r(\theta), \mathrm{clip}(r(\theta), 1-\epsilon, 1+\epsilon)\right) \right]

Training implements the following steps:

  1. For each mini-batch, calculate πθ(as)\pi_\theta(a|s) and πθold(as)\pi_{\theta_{\text{old}}}(a|s).
  2. Compute importance ratios rr and apply clipping.
  3. Accumulate the negative (maximization) of the surrogate into the loss.
  4. Update θ\theta through gradient descent.
  5. Set θoldθ\theta_{\text{old}} \gets \theta for the next batch.

No model architecture changes are required; PSFT is implemented strictly as a modification to the loss function and update protocol.

3. Hyperparameterization and Experimental Procedure

Critical hyperparameters for PSFT include:

  • Train batch size: 256
  • PPO mini-batch size: 32
  • Learning rate: 1×1061 \times 10^{-6} (weight decay: 0.1)
  • Epochs: 10 (typical values: 5–10)
  • Clipping parameter: ϵ=0.28\epsilon = 0.28 (range: 0.2–0.3)
  • Sequence length cutoff: 6K–10K tokens

These choices were held constant across mathematical reasoning settings (e.g., Qwen2.5-7B, Llama3.1-8B), human-value alignment (e.g., Qwen3-4B), and RL cold-start scenarios. Reference policy updates and per-batch logits cache constitute the only loop modifications over SFT.

4. Empirical Results

PSFT performance was systematically evaluated on both mathematical reasoning and alignment tasks:

Mathematical Reasoning

Using the OpenR1-Math-8192 dataset and models such as Qwen2.5-7B, PSFT yields:

Method In-domain avg Out-of-domain avg
Original 37.98 59.85
SFT 47.99 57.90
SFTₖₗ 47.08 57.38
PSFT 46.98 61.26
PSFT₍warm₎ 48.17 58.53

PSFT matches SFT in-domain and achieves a significant improvement in out-of-domain generalization (+3.4 points for vanilla PSFT). When used as the initialization for generative RL post-training (e.g., GRPO), PSFT further amplifies these gains:

Initialization Post-RL in-domain Post-RL out-of-domain
SFT → GRPO 52.40 59.90
PSFT → GRPO 53.31 64.06

Human-Value Alignment

In UltraFeedback → DPO alignment, PSFT reduces generalization loss (“alignment tax”) and improves preference optimization:

Method AlpacaEval LC/WR Arena-Hard WR MT-Bench 1-turn/2-turn
SFT → DPO 16.96 / 13.40 26.50 7.91 / 6.00
PSFTₚᵣₒₗₒₙg → DPO 19.26 / 15.17 30.20 7.63 / 6.74
PSFT → DPO 23.29 / 20.13 36.40 8.51 / 6.95

This demonstrates the preservation of reasoning abilities and stronger alignment outcomes with PSFT-based protocols.

5. Stability and Generalization Analysis

PSFT’s clipped surrogate objective maintains a soft trust region by bounding per-token probability changes to [1ϵ,1+ϵ][1-\epsilon, 1+\epsilon], preventing abrupt policy drift and entropy collapse. Empirical entropy traces remain smooth, with higher entropy than SFT throughout prolonged training. Gradient variance is markedly reduced, avoiding performance swings typical in unconstrained fine-tuning.

A plausible implication is that PSFT’s stable update step prevents overfitting, aiding generalization on out-of-domain tasks and preparing models for subsequent reinforcement learning or preference optimization stages.

6. Limitations and Future Directions

The heuristic selection of the clipping parameter ϵ\epsilon presents an open research question; adaptive trust-region mechanisms could further optimize PSFT. Current formulations treat all offline examples as equivalent (advantage = 1); weighted sampling or per-example advantage estimation may extend the protocol’s flexibility. Integration with parameter-efficient fine-tuning (such as LoRA or adapters) and comparative analysis with other conservative SFT algorithms (iW-SFT, DFT) remain unexplored. Large-scale applications (100B+ parameters) and extension to non-text modalities represent further avenues for research.

7. Contextual Significance

PSFT bridges supervised learning and RL trust-region methods, contributing a principled approach to maintaining foundation model robustness during task specialization. Its capacity to reduce alignment tax and enhance cold-start RL underlines its potential as a stable pre-optimization layer in contemporary LM workflows. The approach sets precedent for incorporating conservative update techniques within offline supervised regimes, making PSFT both an empirical standard and a methodological reference point for ongoing studies in fine-tuning and robustness (Zhu et al., 25 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Proximal SFT (PSFT).