Self-Rewarding PPO for LLM Alignment

Updated 19 January 2026

The paper introduces a self-rewarding mechanism that replaces external feedback with a coherent log-policy ratio reward derived from the difference between SFT and pretrained models.
The algorithm follows a three-stage process—supervised fine-tuning, EOS-based coherent reward computation, and on-policy PPO updates—to enhance data efficiency and generalization.
Empirical results demonstrate that SRPPO improves performance by 2–4 points on benchmarks and maintains robust alignment in data-scarce, out-of-domain settings.

Self-Rewarding Proximal Policy Optimization (SRPPO) denotes a class of reinforcement learning fine-tuning strategies for LLMs in which the reward function is generated directly from the model’s own refinement trajectory, rather than from external preference annotation or independently trained reward models. The prototypical SRPPO algorithm, as described in "Self-Rewarding PPO: Aligning LLMs with Demonstrations Only," leverages a mathematically coherent reward based on the log policy ratio between an SFT (supervised fine-tuned) model and its pretrained base, enabling on-policy optimization from demonstration data alone and enhancing generalization, data efficiency, and robustness without reliance on human-labeled preferences (Zhang et al., 24 Oct 2025). By embedding the reward into the PPO objective as the alignment direction defined by SFT, SRPPO achieves superior alignment of LLMs in out-of-domain and data-scarce settings.

1. Mathematical Framework

The foundation of Self-Rewarding PPO is the coherent reward:

$r(x, y) = \log \pi_{\text{sft}}(y|x) - \log \pi_{\text{pt}}(y|x) = \sum_{j=1}^m \left[\log \pi_{\text{sft}}(y_j | x, y_{<j}) - \log \pi_{\text{pt}}(y_j | x, y_{<j})\right]$

where $\pi_{\text{pt}}$ is the pretrained policy and $\pi_{\text{sft}}$ is the SFT-refined policy. For sequence $y = (y_1, \ldots, y_m)$ given prompt $x$ , the reward is assigned only at the EOS token for each rollout, avoiding length-degeneration pathologies.

SRPPO substitutes this reward for the external objective in the PPO actor loss

$L_{\text{actor}}(\theta) = - \mathbb{E}_t \left[ \min \left( \rho_t(\theta)\, \hat{A}_t,\, \text{clip}\left(\rho_t(\theta), 1-\epsilon, 1+\epsilon\right) \hat{A}_t \right) \right]$

with $\rho_t(\theta) = \pi_\theta(a_t|s_t) / \pi_{\theta_{\text{old}}}(a_t|s_t)$ and $\hat{A}_t$ computed by GAE with $\gamma, \lambda$ . The critic is optimized with standard MSE against the cumulative self-derived reward. This formulation maintains the operational simplicity and monotonic improvement properties of PPO.

2. Algorithmic Implementation

SRPPO proceeds in three stages:

Supervised Fine-Tuning (SFT): The base model $\pi_{\text{pt}}$ is fine-tuned with cross-entropy on a demonstration set $D = \{(x, y)\}$ for 1–2 epochs, yielding $\pi_{\text{sft}}$ for reference.
Coherent Reward Computation: For any sample $(x, y)$ , compute $r(x, y)$ using the log-policy ratio. This becomes the only reward signal for RL.
On-Policy PPO Fine-Tuning: Initialize the PPO actor and critic from SFT. At each iteration, sample prompts from a diverse, unlabeled pool $P$ , collect responses, assign $r(x,y)$ at EOS, compute advantages, and update both actor and critic following typical PPO update steps.

Recommended hyperparameters include SFT batch size 128, epochs 2, PPO actor LR $5\times10^{-8}$ , critic LR $9\times10^{-6}$ , clipping $\epsilon = 0.2$ , GAE $(\lambda=0.95, \gamma=1.0)$ , and KL regularization (optional). Critic warmup buffers and EOS-only reward assignment are important for stability and length control.

Stage	Main Action	Key Setting
SFT	Cross-entropy on $D$	1–2 epochs
Reward	$r(x,y)=\log\pi_{\text{sft}} - \log\pi_{\text{pt}}$	EOS only
PPO	On-policy update over $P$	$\epsilon=0.2$ , LR $=5\times10^{-8}$

3. Theoretical Properties

The log-ratio reward provides a sensitive, task-adaptive feedback that reflects how much the SFT policy improves over pretraining on a per-sample basis. Because the reward is derived from the same model family under refinement, it rapidly reflects distribution shifts and avoids static overfitting.

SRPPO does not supply a formal convergence proof specific to the self-reward setup, but inherits the monotonic improvement guarantees of PPO under clipped surrogates as described by Schulman et al. (2017). The method also connects to the coherent soft imitation learning framework [Watson et al., 2024], which shows theoretical convergence to the expert under adequate policy capacity and data coverage.

A plausible implication is that, unlike behavior cloning, the on-policy self-alignment trajectory incentivizes continual adaptation and exploration, especially beneficial for out-of-distribution generalization.

4. Empirical Performance

The SRPPO framework is benchmarked using TULU-v2-mix high-quality demonstrations for SFT and UltraFeedback prompts for PPO exploration. Evaluation tasks include IFEval (instruction following), GSM8k (mathematical reasoning), GPQA (graduate-level QA, via CoT and direct), and AlpacaEval (conversational win-rate).

In the minimum-overlap setting on Mistral-7B and LLAMA3-8B, SRPPO consistently outperforms both SFT (even extended SFT) and prior policy-gradient methods such as SPIN and PPO with external RMs. Notably, SRPPO yields average score gains of 2–4 points on major metrics, especially on generalization- and data-efficiency-sensitive benchmarks.

Method	IFEval L/S	GSM8k EM	GPQA CoT/NonCoT	AlpacaEval	Average
Pretrain	30.6/29.4	37.3	12.5/27.2	0.07/0.12	21.8
SFT (2ep)	42.5/40.5	46.5	23.9/26.3	8.95/4.60	30.0
SPIN	45.1/38.7	43.0	19.9/26.6	5.81/4.29	28.3
SRPPO	47.6/41.4	46.9	24.3/26.6	12.5/13.2	32.4

Generalization is especially pronounced in out-of-domain tasks (mathematics, complex instructions), with the model maintaining robustness to domain shift even when further SFT is performed on unrelated data. SRPPO’s reward, derived from model improvement rather than external annotation, leads to consistently larger performance gains than PPO using external preference RMs.

5. Practical Implementation Guidance

Short SFT (1–2 epochs) is recommended, serving primarily to set the alignment direction for the RL phase and mitigating SFT overfitting. The reward should be applied at the sequence (EOS) level to prevent unwarranted length incentives, and a large unlabeled prompt set is critical for effective on-policy exploration. Very low actor learning rates and standard PPO buffer sizes are effective, with optional KL regularization stabilizing the fine-tuning trajectory.

Implementation in standard open-source PPO libraries (e.g., OpenRLHF) involves substituting the external reward term with $r(x, y)=\log\pi_{\text{sft}}(y|x) - \log\pi_{\text{pt}}(y|x)$ and adjusting reward assignment to the EOS token. Monitoring generation length and maintaining critic stability via buffer warmup are best practices.

6. Relation to Broader Self-Rewarding PPO Variants

Beyond the LLM-focused SRPPO, self-rewarding mechanisms appear in other adaptations of the PPO framework, such as PPO-BR (Rahman, 23 May 2025). PPO-BR implements bidirectional regularization of the PPO trust region via dual policy-derived signals: entropy-driven expansion (enhancing exploration when uncertainty is high) and reward-guided contraction (improving stability when reward improvement slows). The adaptive clipping parameter $\epsilon_t$ is governed by both exploration and convergence cues, enabling a "self-rewarding" trust region. PPO-BR achieves substantial convergence acceleration and variance reduction on control benchmarks and is positioned as broadly applicable—across both RL environments and LLM alignment—due to its lightweight implementation and theoretical guarantees. However, unlike SRPPO, PPO-BR’s self-rewarding refers to the adaptive learning dynamics, not the explicit reward for actions.

SRPPO is distinct from (and complementary to) prior approaches such as SPIN (self-play DPO) and PPO with external learned reward models. While those methods either utilize groupwise ranking or external preference annotation, SRPPO’s coherent reward enables purely demonstration-based, annotation-free alignment. Compared to GRPO, which operates on group relative policy optimization and does not incorporate entropy or phase-aware adaptation, SRPPO and PPO-BR introduce self-derived feedback loops that foster sharper exploration-exploitation balance. Theoretical and empirical analysis suggests that defining feedback relative to model improvement, rather than external signal, facilitates more robust and generalizable policy refinement in both language and control domains (Zhang et al., 24 Oct 2025, Rahman, 23 May 2025).

Markdown Report Issue Upgrade to Chat

References (2)

Self-Rewarding PPO: Aligning Large Language Models with Demonstrations Only (2025)

PPO-BR: Dual-Signal Entropy-Reward Adaptation for Trust Region Policy Optimization (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Self-Rewarding PPO.