Proximal SFT (PSFT)
- Proximal SFT (PSFT) is a supervised training paradigm that incorporates trust-region constraints to restrict per-token probability updates and prevent policy drift.
- It adapts PPO’s surrogate objective to fine-tuning by clipping probability ratios, reducing gradient variance and avoiding entropy collapse.
- Empirical results show PSFT improves out-of-domain performance and alignment compared to standard SFT, making it effective for robust model specialization.
Proximal Supervised Fine-Tuning (PSFT) is a supervised training paradigm for large foundation models that introduces trust-region constraints into the standard fine-tuning process to address generalization decay and capability preservation. Rather than minimizing cross-entropy loss alone, PSFT applies a proximal objective—drawing on principles from Trust-Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) in reinforcement learning—to restrict per-token probability updates, thereby constraining policy drift and mitigating @@@@10@@@@.
1. Theoretical Framework and Loss Derivation
PSFT is formulated by reinterpreting supervised fine-tuning (SFT) as a degenerate policy-gradient update. In typical RL, policy-gradient loss is expressed as:
where is the advantage. SFT minimizes cross-entropy:
Recasting SFT as a policy-gradient update for an offline dataset , every supervised token is treated as optimal (i.e., ), producing a single-step gradient that is unconstrained in the policy space. This lack of a trust region can result in overfitting and degeneration of pre-existing model capabilities.
TRPO introduces a hard KL-divergence constraint, and PPO replaces this with a clipped surrogate objective. PSFT adapts PPO’s surrogate to the supervised setting, yielding a new loss:
with
where is a reference parameter snapshot, updated after each batch or epoch. The gradient of the PSFT objective further masks tokens exceeding the clipping bounds, promoting conservative update steps.
2. Training Objective and Algorithmic Structure
PSFT’s main objective maximizes the clipped surrogate over each pair drawn from the training data:
Training implements the following steps:
- For each mini-batch, calculate and .
- Compute importance ratios and apply clipping.
- Accumulate the negative (maximization) of the surrogate into the loss.
- Update through gradient descent.
- Set for the next batch.
No model architecture changes are required; PSFT is implemented strictly as a modification to the loss function and update protocol.
3. Hyperparameterization and Experimental Procedure
Critical hyperparameters for PSFT include:
- Train batch size: 256
- PPO mini-batch size: 32
- Learning rate: (weight decay: 0.1)
- Epochs: 10 (typical values: 5–10)
- Clipping parameter: (range: 0.2–0.3)
- Sequence length cutoff: 6K–10K tokens
These choices were held constant across mathematical reasoning settings (e.g., Qwen2.5-7B, Llama3.1-8B), human-value alignment (e.g., Qwen3-4B), and RL cold-start scenarios. Reference policy updates and per-batch logits cache constitute the only loop modifications over SFT.
4. Empirical Results
PSFT performance was systematically evaluated on both mathematical reasoning and alignment tasks:
Mathematical Reasoning
Using the OpenR1-Math-8192 dataset and models such as Qwen2.5-7B, PSFT yields:
| Method | In-domain avg | Out-of-domain avg |
|---|---|---|
| Original | 37.98 | 59.85 |
| SFT | 47.99 | 57.90 |
| SFTₖₗ | 47.08 | 57.38 |
| PSFT | 46.98 | 61.26 |
| PSFT₍warm₎ | 48.17 | 58.53 |
PSFT matches SFT in-domain and achieves a significant improvement in out-of-domain generalization (+3.4 points for vanilla PSFT). When used as the initialization for generative RL post-training (e.g., GRPO), PSFT further amplifies these gains:
| Initialization | Post-RL in-domain | Post-RL out-of-domain |
|---|---|---|
| SFT → GRPO | 52.40 | 59.90 |
| PSFT → GRPO | 53.31 | 64.06 |
Human-Value Alignment
In UltraFeedback → DPO alignment, PSFT reduces generalization loss (“alignment tax”) and improves preference optimization:
| Method | AlpacaEval LC/WR | Arena-Hard WR | MT-Bench 1-turn/2-turn |
|---|---|---|---|
| SFT → DPO | 16.96 / 13.40 | 26.50 | 7.91 / 6.00 |
| PSFTₚᵣₒₗₒₙg → DPO | 19.26 / 15.17 | 30.20 | 7.63 / 6.74 |
| PSFT → DPO | 23.29 / 20.13 | 36.40 | 8.51 / 6.95 |
This demonstrates the preservation of reasoning abilities and stronger alignment outcomes with PSFT-based protocols.
5. Stability and Generalization Analysis
PSFT’s clipped surrogate objective maintains a soft trust region by bounding per-token probability changes to , preventing abrupt policy drift and entropy collapse. Empirical entropy traces remain smooth, with higher entropy than SFT throughout prolonged training. Gradient variance is markedly reduced, avoiding performance swings typical in unconstrained fine-tuning.
A plausible implication is that PSFT’s stable update step prevents overfitting, aiding generalization on out-of-domain tasks and preparing models for subsequent reinforcement learning or preference optimization stages.
6. Limitations and Future Directions
The heuristic selection of the clipping parameter presents an open research question; adaptive trust-region mechanisms could further optimize PSFT. Current formulations treat all offline examples as equivalent (advantage = 1); weighted sampling or per-example advantage estimation may extend the protocol’s flexibility. Integration with parameter-efficient fine-tuning (such as LoRA or adapters) and comparative analysis with other conservative SFT algorithms (iW-SFT, DFT) remain unexplored. Large-scale applications (100B+ parameters) and extension to non-text modalities represent further avenues for research.
7. Contextual Significance
PSFT bridges supervised learning and RL trust-region methods, contributing a principled approach to maintaining foundation model robustness during task specialization. Its capacity to reduce alignment tax and enhance cold-start RL underlines its potential as a stable pre-optimization layer in contemporary LM workflows. The approach sets precedent for incorporating conservative update techniques within offline supervised regimes, making PSFT both an empirical standard and a methodological reference point for ongoing studies in fine-tuning and robustness (Zhu et al., 25 Aug 2025).