Self-Rewarding PPO for LLM Alignment
- The paper introduces a self-rewarding mechanism that replaces external feedback with a coherent log-policy ratio reward derived from the difference between SFT and pretrained models.
- The algorithm follows a three-stage process—supervised fine-tuning, EOS-based coherent reward computation, and on-policy PPO updates—to enhance data efficiency and generalization.
- Empirical results demonstrate that SRPPO improves performance by 2–4 points on benchmarks and maintains robust alignment in data-scarce, out-of-domain settings.
Self-Rewarding Proximal Policy Optimization (SRPPO) denotes a class of reinforcement learning fine-tuning strategies for LLMs in which the reward function is generated directly from the model’s own refinement trajectory, rather than from external preference annotation or independently trained reward models. The prototypical SRPPO algorithm, as described in "Self-Rewarding PPO: Aligning LLMs with Demonstrations Only," leverages a mathematically coherent reward based on the log policy ratio between an SFT (supervised fine-tuned) model and its pretrained base, enabling on-policy optimization from demonstration data alone and enhancing generalization, data efficiency, and robustness without reliance on human-labeled preferences (Zhang et al., 24 Oct 2025). By embedding the reward into the PPO objective as the alignment direction defined by SFT, SRPPO achieves superior alignment of LLMs in out-of-domain and data-scarce settings.
1. Mathematical Framework
The foundation of Self-Rewarding PPO is the coherent reward:
where is the pretrained policy and is the SFT-refined policy. For sequence given prompt , the reward is assigned only at the EOS token for each rollout, avoiding length-degeneration pathologies.
SRPPO substitutes this reward for the external objective in the PPO actor loss
with and computed by GAE with . The critic is optimized with standard MSE against the cumulative self-derived reward. This formulation maintains the operational simplicity and monotonic improvement properties of PPO.
2. Algorithmic Implementation
SRPPO proceeds in three stages:
- Supervised Fine-Tuning (SFT): The base model is fine-tuned with cross-entropy on a demonstration set for 1–2 epochs, yielding for reference.
- Coherent Reward Computation: For any sample , compute using the log-policy ratio. This becomes the only reward signal for RL.
- On-Policy PPO Fine-Tuning: Initialize the PPO actor and critic from SFT. At each iteration, sample prompts from a diverse, unlabeled pool , collect responses, assign at EOS, compute advantages, and update both actor and critic following typical PPO update steps.
Recommended hyperparameters include SFT batch size 128, epochs 2, PPO actor LR , critic LR , clipping , GAE , and KL regularization (optional). Critic warmup buffers and EOS-only reward assignment are important for stability and length control.
| Stage | Main Action | Key Setting |
|---|---|---|
| SFT | Cross-entropy on | 1–2 epochs |
| Reward | EOS only | |
| PPO | On-policy update over | , LR |
3. Theoretical Properties
The log-ratio reward provides a sensitive, task-adaptive feedback that reflects how much the SFT policy improves over pretraining on a per-sample basis. Because the reward is derived from the same model family under refinement, it rapidly reflects distribution shifts and avoids static overfitting.
SRPPO does not supply a formal convergence proof specific to the self-reward setup, but inherits the monotonic improvement guarantees of PPO under clipped surrogates as described by Schulman et al. (2017). The method also connects to the coherent soft imitation learning framework [Watson et al., 2024], which shows theoretical convergence to the expert under adequate policy capacity and data coverage.
A plausible implication is that, unlike behavior cloning, the on-policy self-alignment trajectory incentivizes continual adaptation and exploration, especially beneficial for out-of-distribution generalization.
4. Empirical Performance
The SRPPO framework is benchmarked using TULU-v2-mix high-quality demonstrations for SFT and UltraFeedback prompts for PPO exploration. Evaluation tasks include IFEval (instruction following), GSM8k (mathematical reasoning), GPQA (graduate-level QA, via CoT and direct), and AlpacaEval (conversational win-rate).
In the minimum-overlap setting on Mistral-7B and LLAMA3-8B, SRPPO consistently outperforms both SFT (even extended SFT) and prior policy-gradient methods such as SPIN and PPO with external RMs. Notably, SRPPO yields average score gains of 2–4 points on major metrics, especially on generalization- and data-efficiency-sensitive benchmarks.
| Method | IFEval L/S | GSM8k EM | GPQA CoT/NonCoT | AlpacaEval | Average |
|---|---|---|---|---|---|
| Pretrain | 30.6/29.4 | 37.3 | 12.5/27.2 | 0.07/0.12 | 21.8 |
| SFT (2ep) | 42.5/40.5 | 46.5 | 23.9/26.3 | 8.95/4.60 | 30.0 |
| SPIN | 45.1/38.7 | 43.0 | 19.9/26.6 | 5.81/4.29 | 28.3 |
| SRPPO | 47.6/41.4 | 46.9 | 24.3/26.6 | 12.5/13.2 | 32.4 |
Generalization is especially pronounced in out-of-domain tasks (mathematics, complex instructions), with the model maintaining robustness to domain shift even when further SFT is performed on unrelated data. SRPPO’s reward, derived from model improvement rather than external annotation, leads to consistently larger performance gains than PPO using external preference RMs.
5. Practical Implementation Guidance
Short SFT (1–2 epochs) is recommended, serving primarily to set the alignment direction for the RL phase and mitigating SFT overfitting. The reward should be applied at the sequence (EOS) level to prevent unwarranted length incentives, and a large unlabeled prompt set is critical for effective on-policy exploration. Very low actor learning rates and standard PPO buffer sizes are effective, with optional KL regularization stabilizing the fine-tuning trajectory.
Implementation in standard open-source PPO libraries (e.g., OpenRLHF) involves substituting the external reward term with and adjusting reward assignment to the EOS token. Monitoring generation length and maintaining critic stability via buffer warmup are best practices.
6. Relation to Broader Self-Rewarding PPO Variants
Beyond the LLM-focused SRPPO, self-rewarding mechanisms appear in other adaptations of the PPO framework, such as PPO-BR (Rahman, 23 May 2025). PPO-BR implements bidirectional regularization of the PPO trust region via dual policy-derived signals: entropy-driven expansion (enhancing exploration when uncertainty is high) and reward-guided contraction (improving stability when reward improvement slows). The adaptive clipping parameter is governed by both exploration and convergence cues, enabling a "self-rewarding" trust region. PPO-BR achieves substantial convergence acceleration and variance reduction on control benchmarks and is positioned as broadly applicable—across both RL environments and LLM alignment—due to its lightweight implementation and theoretical guarantees. However, unlike SRPPO, PPO-BR’s self-rewarding refers to the adaptive learning dynamics, not the explicit reward for actions.
7. Comparison with Related Methods
SRPPO is distinct from (and complementary to) prior approaches such as SPIN (self-play DPO) and PPO with external learned reward models. While those methods either utilize groupwise ranking or external preference annotation, SRPPO’s coherent reward enables purely demonstration-based, annotation-free alignment. Compared to GRPO, which operates on group relative policy optimization and does not incorporate entropy or phase-aware adaptation, SRPPO and PPO-BR introduce self-derived feedback loops that foster sharper exploration-exploitation balance. Theoretical and empirical analysis suggests that defining feedback relative to model improvement, rather than external signal, facilitates more robust and generalizable policy refinement in both language and control domains (Zhang et al., 24 Oct 2025, Rahman, 23 May 2025).