Papers
Topics
Authors
Recent
Search
2000 character limit reached

Self-Rewarding PPO for LLM Alignment

Updated 19 January 2026
  • The paper introduces a self-rewarding mechanism that replaces external feedback with a coherent log-policy ratio reward derived from the difference between SFT and pretrained models.
  • The algorithm follows a three-stage process—supervised fine-tuning, EOS-based coherent reward computation, and on-policy PPO updates—to enhance data efficiency and generalization.
  • Empirical results demonstrate that SRPPO improves performance by 2–4 points on benchmarks and maintains robust alignment in data-scarce, out-of-domain settings.

Self-Rewarding Proximal Policy Optimization (SRPPO) denotes a class of reinforcement learning fine-tuning strategies for LLMs in which the reward function is generated directly from the model’s own refinement trajectory, rather than from external preference annotation or independently trained reward models. The prototypical SRPPO algorithm, as described in "Self-Rewarding PPO: Aligning LLMs with Demonstrations Only," leverages a mathematically coherent reward based on the log policy ratio between an SFT (supervised fine-tuned) model and its pretrained base, enabling on-policy optimization from demonstration data alone and enhancing generalization, data efficiency, and robustness without reliance on human-labeled preferences (Zhang et al., 24 Oct 2025). By embedding the reward into the PPO objective as the alignment direction defined by SFT, SRPPO achieves superior alignment of LLMs in out-of-domain and data-scarce settings.

1. Mathematical Framework

The foundation of Self-Rewarding PPO is the coherent reward:

r(x,y)=logπsft(yx)logπpt(yx)=j=1m[logπsft(yjx,y<j)logπpt(yjx,y<j)]r(x, y) = \log \pi_{\text{sft}}(y|x) - \log \pi_{\text{pt}}(y|x) = \sum_{j=1}^m \left[\log \pi_{\text{sft}}(y_j | x, y_{<j}) - \log \pi_{\text{pt}}(y_j | x, y_{<j})\right]

where πpt\pi_{\text{pt}} is the pretrained policy and πsft\pi_{\text{sft}} is the SFT-refined policy. For sequence y=(y1,,ym)y = (y_1, \ldots, y_m) given prompt xx, the reward is assigned only at the EOS token for each rollout, avoiding length-degeneration pathologies.

SRPPO substitutes this reward for the external objective in the PPO actor loss

Lactor(θ)=Et[min(ρt(θ)A^t,clip(ρt(θ),1ϵ,1+ϵ)A^t)]L_{\text{actor}}(\theta) = - \mathbb{E}_t \left[ \min \left( \rho_t(\theta)\, \hat{A}_t,\, \text{clip}\left(\rho_t(\theta), 1-\epsilon, 1+\epsilon\right) \hat{A}_t \right) \right]

with ρt(θ)=πθ(atst)/πθold(atst)\rho_t(\theta) = \pi_\theta(a_t|s_t) / \pi_{\theta_{\text{old}}}(a_t|s_t) and A^t\hat{A}_t computed by GAE with γ,λ\gamma, \lambda. The critic is optimized with standard MSE against the cumulative self-derived reward. This formulation maintains the operational simplicity and monotonic improvement properties of PPO.

2. Algorithmic Implementation

SRPPO proceeds in three stages:

  1. Supervised Fine-Tuning (SFT): The base model πpt\pi_{\text{pt}} is fine-tuned with cross-entropy on a demonstration set D={(x,y)}D = \{(x, y)\} for 1–2 epochs, yielding πsft\pi_{\text{sft}} for reference.
  2. Coherent Reward Computation: For any sample (x,y)(x, y), compute r(x,y)r(x, y) using the log-policy ratio. This becomes the only reward signal for RL.
  3. On-Policy PPO Fine-Tuning: Initialize the PPO actor and critic from SFT. At each iteration, sample prompts from a diverse, unlabeled pool PP, collect responses, assign r(x,y)r(x,y) at EOS, compute advantages, and update both actor and critic following typical PPO update steps.

Recommended hyperparameters include SFT batch size 128, epochs 2, PPO actor LR 5×1085\times10^{-8}, critic LR 9×1069\times10^{-6}, clipping ϵ=0.2\epsilon = 0.2, GAE (λ=0.95,γ=1.0)(\lambda=0.95, \gamma=1.0), and KL regularization (optional). Critic warmup buffers and EOS-only reward assignment are important for stability and length control.

Stage Main Action Key Setting
SFT Cross-entropy on DD 1–2 epochs
Reward r(x,y)=logπsftlogπptr(x,y)=\log\pi_{\text{sft}} - \log\pi_{\text{pt}} EOS only
PPO On-policy update over PP ϵ=0.2\epsilon=0.2, LR=5×108=5\times10^{-8}

3. Theoretical Properties

The log-ratio reward provides a sensitive, task-adaptive feedback that reflects how much the SFT policy improves over pretraining on a per-sample basis. Because the reward is derived from the same model family under refinement, it rapidly reflects distribution shifts and avoids static overfitting.

SRPPO does not supply a formal convergence proof specific to the self-reward setup, but inherits the monotonic improvement guarantees of PPO under clipped surrogates as described by Schulman et al. (2017). The method also connects to the coherent soft imitation learning framework [Watson et al., 2024], which shows theoretical convergence to the expert under adequate policy capacity and data coverage.

A plausible implication is that, unlike behavior cloning, the on-policy self-alignment trajectory incentivizes continual adaptation and exploration, especially beneficial for out-of-distribution generalization.

4. Empirical Performance

The SRPPO framework is benchmarked using TULU-v2-mix high-quality demonstrations for SFT and UltraFeedback prompts for PPO exploration. Evaluation tasks include IFEval (instruction following), GSM8k (mathematical reasoning), GPQA (graduate-level QA, via CoT and direct), and AlpacaEval (conversational win-rate).

In the minimum-overlap setting on Mistral-7B and LLAMA3-8B, SRPPO consistently outperforms both SFT (even extended SFT) and prior policy-gradient methods such as SPIN and PPO with external RMs. Notably, SRPPO yields average score gains of 2–4 points on major metrics, especially on generalization- and data-efficiency-sensitive benchmarks.

Method IFEval L/S GSM8k EM GPQA CoT/NonCoT AlpacaEval Average
Pretrain 30.6/29.4 37.3 12.5/27.2 0.07/0.12 21.8
SFT (2ep) 42.5/40.5 46.5 23.9/26.3 8.95/4.60 30.0
SPIN 45.1/38.7 43.0 19.9/26.6 5.81/4.29 28.3
SRPPO 47.6/41.4 46.9 24.3/26.6 12.5/13.2 32.4

Generalization is especially pronounced in out-of-domain tasks (mathematics, complex instructions), with the model maintaining robustness to domain shift even when further SFT is performed on unrelated data. SRPPO’s reward, derived from model improvement rather than external annotation, leads to consistently larger performance gains than PPO using external preference RMs.

5. Practical Implementation Guidance

Short SFT (1–2 epochs) is recommended, serving primarily to set the alignment direction for the RL phase and mitigating SFT overfitting. The reward should be applied at the sequence (EOS) level to prevent unwarranted length incentives, and a large unlabeled prompt set is critical for effective on-policy exploration. Very low actor learning rates and standard PPO buffer sizes are effective, with optional KL regularization stabilizing the fine-tuning trajectory.

Implementation in standard open-source PPO libraries (e.g., OpenRLHF) involves substituting the external reward term with r(x,y)=logπsft(yx)logπpt(yx)r(x, y)=\log\pi_{\text{sft}}(y|x) - \log\pi_{\text{pt}}(y|x) and adjusting reward assignment to the EOS token. Monitoring generation length and maintaining critic stability via buffer warmup are best practices.

6. Relation to Broader Self-Rewarding PPO Variants

Beyond the LLM-focused SRPPO, self-rewarding mechanisms appear in other adaptations of the PPO framework, such as PPO-BR (Rahman, 23 May 2025). PPO-BR implements bidirectional regularization of the PPO trust region via dual policy-derived signals: entropy-driven expansion (enhancing exploration when uncertainty is high) and reward-guided contraction (improving stability when reward improvement slows). The adaptive clipping parameter ϵt\epsilon_t is governed by both exploration and convergence cues, enabling a "self-rewarding" trust region. PPO-BR achieves substantial convergence acceleration and variance reduction on control benchmarks and is positioned as broadly applicable—across both RL environments and LLM alignment—due to its lightweight implementation and theoretical guarantees. However, unlike SRPPO, PPO-BR’s self-rewarding refers to the adaptive learning dynamics, not the explicit reward for actions.

SRPPO is distinct from (and complementary to) prior approaches such as SPIN (self-play DPO) and PPO with external learned reward models. While those methods either utilize groupwise ranking or external preference annotation, SRPPO’s coherent reward enables purely demonstration-based, annotation-free alignment. Compared to GRPO, which operates on group relative policy optimization and does not incorporate entropy or phase-aware adaptation, SRPPO and PPO-BR introduce self-derived feedback loops that foster sharper exploration-exploitation balance. Theoretical and empirical analysis suggests that defining feedback relative to model improvement, rather than external signal, facilitates more robust and generalizable policy refinement in both language and control domains (Zhang et al., 24 Oct 2025, Rahman, 23 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Self-Rewarding PPO.