Proximal Sequence Policy Optimization (PSPO)

Updated 22 January 2026

PSPO is a reinforcement learning framework that treats whole response sequences as atomic actions, improving reward attribution and stability.
It aligns policy and value updates to the sequence level, mitigating noise from token-level credit assignment in complex, multi-turn tasks.
Empirical evaluation shows PSPO achieves higher sample efficiency and robust performance in applications like academic paper retrieval and conditional sequence generation.

Proximal Sequence Policy Optimization (PSPO) refers to a class of reinforcement learning (RL) algorithms designed for robust, stable optimization in sequence generation and multi-turn agentic tasks. PSPO enhances the standard @@@@1@@@@ (PPO) framework by aligning the granularity of optimization with the structure of the agent-environment interaction, ensuring that credit assignment, policy updates, and reward signals respect the atomic units of complex sequential tasks. PSPO has been empirically validated in domains such as academic paper search with the PaperScout agent and in conditional sequence generation, demonstrating improved sample efficiency, stability, and final-task performance compared to both vanilla policy gradient and token-level PPO baselines (Pan et al., 15 Jan 2026, Tuan et al., 2018).

1. Motivation and Background

Multi-turn, agentic tasks, such as LLM-driven academic paper search or sophisticated dialogue systems, intrinsically decompose into discrete interaction turns, each producing complex, high-dimensional action sequences (e.g., full responses or tool calls) and yielding sparse, delayed rewards. Standard reinforcement learning strategies, including token-level PPO, are mismatched for such domains: they distribute sequence-level rewards across individual tokens, leading to noisy credit assignment and instability in value learning. Outcome-level RL methods, such as GRPO or GSPO, operate at the trajectory level, disregarding valuable intermediate process signals and exacerbating credit assignment challenges (Pan et al., 15 Jan 2026).

PSPO addresses this granularity mismatch by considering entire sequences (e.g., an agent’s full response at each turn) as atomic actions and aligning both policy and value updates to the sequence level. This process-aware optimization enables more precise reward attribution, better utilization of process signals, and improved synergy between RL optimization and the task’s structure (Pan et al., 15 Jan 2026).

2. Formal PSPO Objectives and Algorithm

PSPO is formulated within a partially observable Markov decision process (POMDP), where, at each turn $t$ , the agent observes state $x_t$ , outputs an action $y_t$ (a full sequence), executes any necessary tool-calls, and receives a turn-level reward $r_t$ . Both policy $\pi_\theta$ and value $V_\phi$ networks are updated at the sequence level using generalized advantage estimation (GAE).

Rewards and Returns:

For tasks such as academic paper retrieval, the reward is defined as:

$r_t = \sum_{p \in \text{top-}k(V_t)} \rho(p) - \eta \sum_{c \in C_t} \mathbf{1}[c \textrm{ was used before}]$

where $\rho(p) \in [0, 1]$ is a relevance score and $\eta$ controls penalty for redundant tool usage.

Returns and advantages are computed as:

$R_t = \sum_{l=0}^{T-t-1} \gamma^l r_{t+l} \ \delta_t = r_t + \gamma V_\phi(x_{t+1}) - V_\phi(x_t) \ \hat{A}_t = \sum_{l=0}^{T-t-1} (\gamma \lambda)^l \delta_{t+l}$

Surrogate Objectives:

Actor loss: The sequence-level importance ratio is

$w_t(\theta) = \frac{\pi_\theta(y_t|x_t)}{\pi_{\theta_\text{old}}(y_t|x_t)}$

yielding the clipped surrogate objective:

$L_\text{actor}(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta_\text{old}}} [ \min\left(w_t(\theta)\hat{A}_t,\ \mathrm{clip}(w_t(\theta), 1-\epsilon_\text{low}, 1+\epsilon_\text{high}) \hat{A}_t\right) ]$

Critic loss: Using normalized returns across the batch,

$L_\text{critic}(\phi) = \mathbb{E}_{\tau \sim \pi}\left[ \left(V_\phi(x_t) - \mathrm{Norm}(R_t)\right)^2 \right]$

PSPO uses asymmetric clipping $[\epsilon_\text{low}, \epsilon_\text{high}]$ for enhanced exploration and stability.

Algorithmic Steps:

Collect $N$ trajectories per epoch, each with $T$ turns.
At each turn, generate response $y_t$ , execute tool calls, record $(x_t, y_t, r_t, V_t)$ .
Compute returns $R_t$ , temporal differences $\delta_t$ , and advantages $\hat{A}_t$ .
Normalize $R_t$ .
Optimize $L_\text{critic}$ and $L_\text{actor}$ (optionally with value pre-training, gradient clipping, learning-rate scheduling).

This approach is process-aware, with rewards and updates aligned to sequence-level agent actions, unlike token-level PPO (Pan et al., 15 Jan 2026).

3. PSPO in Sequence Generation: Standard and Dynamic Variants

Proximal Sequence Policy Optimization can also denote adaptation of PPO for sequence-to-sequence (seq2seq) RL problems, as presented in prior work (Tuan et al., 2018). In standard PSPO, the policy $\pi_\theta$ selects tokens incrementally, but reward assignment and policy updates can be at the token or sequence level.

Standard PPO for Sequences:

The token-level importance ratio at time $t$ is $r_t(\theta) = \pi_\theta(a_t|s_t)/\pi_{\theta_\text{old}}(a_t|s_t)$ .
Main surrogate objective:

$L^{\mathrm{PPO}}(\theta) = \mathbb{E}_{(s_t,a_t)\sim \pi_{\theta_\text{old}}}\left[ \min\left(r_t(\theta) A_t,\ \mathrm{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)A_t \right) \right]$

PPO-dynamic Variant:

Clipping bounds are adapted based on action probability, expanding the trust region for less-probable actions to encourage exploration:

$\beta(\cdot) = \min(\beta_1, \beta_2 \sqrt{1 / \pi_{\theta_\text{old}}(a_t|s_t) - 1}) \ \alpha(\cdot) = \min(\alpha_1, \alpha_2 \sqrt{1 / \pi_{\theta_\text{old}}(a_t|s_t) - 1})$

Dynamic surrogate objective:

$L^{\textrm{PPO-dyn}}(\theta) = \mathbb{E}_{(s_t, a_t)} \left[ \min\left(r_t(\theta)A_t,\ \mathrm{clip}(r_t(\theta), 1 - \beta(\cdot), 1 + \alpha(\cdot))A_t \right) \right]$

This approach accelerates convergence by loosening constraints where the old policy is uncertain and enforces stricter updates where it is more confident (Tuan et al., 2018).

4. Theoretical Properties and Empirical Behavior

PSPO’s advantages stem from improved alignment of optimization granularity, reward attribution, and stability mechanisms:

Improved Credit Assignment: Reward signals are attributed to whole responses, mitigating the signal dilution that arises in token-level reward distribution.
Process-Awareness: Intermediate rewards at each sequence step enable richer and more informative value updates, in contrast to outcome-only optimization that disregards pivotal process steps.
Variance Reduction and Stability: Top- $k$ reward aggregation, running mean-variance return normalization, and asymmetric or dynamic clipping produce more stable actor gradients and diminish variance across heterogeneous query difficulties.
Empirical Convergence: In academic paper retrieval and synthetic sequence-generation, PSPO converges faster and to better final returns compared to both token-level PPO and outcome-level methods like GSPO (Pan et al., 15 Jan 2026, Tuan et al., 2018).

5. Experimental Evaluation

Comprehensive empirical studies validate PSPO and its variants on both synthetic and real-world tasks.

PaperScout Experiments (Pan et al., 15 Jan 2026):

Benchmarks: AutoScholarQuery (33k synthetic queries), RealScholarQuery (50 expert-curated queries).
Retrieval Setup: Tool-based search/expand on Milvus/ar5iv during training, Google Search for test evaluation.
Baselines: Google Search, Google Scholar, fixed-workflow RL (PaSa, SPAR), PPO, GSPO (on Qwen3–4B).
Metrics: Precision, Recall, F1; LLM-Score (0–3 by three large models).
Key Results:
- PSPO-trained PaperScout achieves recall of 0.574, LLM-score of 2.576, outperforming GSPO (Recall 0.557, LLM-score 2.510), PPO (Recall 0.537, LLM-score 2.417).
- PSPO displays sample efficiency, robust retrieval with limited tool calls, and stable optimization (lower actor gradient norms, lower critic loss).
Ablations: Reverting to token-level PPO or outcome-level GSPO degrades stability and final performance; intermediate process-aware rewards are critical.

Sequence Generation Studies (Tuan et al., 2018):

Synthetic Counting:
- PPO-dynamic achieves 98.62% precision (vs. 98.42% PPO; REINFORCE slightly higher but with mode collapse).
- PPO-dynamic diversifies outputs and accelerates convergence relative to fixed-ε PPO.
Chatbot (OpenSubtitles):
- PPO-dynamic attains the highest BLEU-2 (14.73; vs. 14.12 PPO, 14.29 REINFORCE) and rapid convergence.
Stability: Both standard and dynamic PSPO outperform REINFORCE in variance reduction, convergence speed, and output diversity.

6. Practical Implementation Guidelines

Recommended practices for PSPO deployment in sequence generation and agentic tasks include:

Pretraining policies by maximum-likelihood to stabilize early-stage learning.
Tuning clipping parameters ( $\epsilon$ for PPO, $\alpha_2=\beta_2$ for PPO-dynamic) based on validation performance; values of $\epsilon \in [0.1, 0.2]$ and $\alpha_2 \approx 1.0$ have been validated.
Employing learned value baselines ( $b(s_t)$ ) to reduce advantage estimation variance.
Monitoring empirical KL-divergence between policy iterations to prevent premature policy drift.
In adversarial RL (e.g., SeqGAN), replacing vanilla policy gradient with PSPO enhances GAN training stability (Tuan et al., 2018).
For multi-turn, process-driven domains, aligning reward granularity and value estimation with the actual agentic turns is essential for both convergence and final-task performance (Pan et al., 15 Jan 2026).

7. Implications and Distinctions

PSPO unifies and extends PPO-based RL for both discrete sequence generation and multi-turn agentic settings by emphasizing process-aware, sequence-level optimization. It mitigates the main deficiencies of token-level optimizers (credit assignment, instability under sparse rewards) and outcome-level optimizers (discarded process signals), enabling stable, sample-efficient, high-performance RL across diverse, high-dimensional sequence tasks.

A plausible implication is that as agentic AI tasks increase in complexity, frameworks explicitly accounting for process structure and reward granularity, such as PSPO, are likely to underlie state-of-the-art autonomous agents in information retrieval, dialogue, and tool-driven reasoning (Pan et al., 15 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (2)

PaperScout: An Autonomous Agent for Academic Paper Search with Process-Aware Sequence-Level Policy Optimization (2026)

Proximal Policy Optimization and its Dynamic Version for Sequence Generation (2018)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Proximal Sequence Policy Optimization (PSPO).