Proximal Sequence Policy Optimization (PSPO)
- PSPO is a reinforcement learning framework that treats whole response sequences as atomic actions, improving reward attribution and stability.
- It aligns policy and value updates to the sequence level, mitigating noise from token-level credit assignment in complex, multi-turn tasks.
- Empirical evaluation shows PSPO achieves higher sample efficiency and robust performance in applications like academic paper retrieval and conditional sequence generation.
Proximal Sequence Policy Optimization (PSPO) refers to a class of reinforcement learning (RL) algorithms designed for robust, stable optimization in sequence generation and multi-turn agentic tasks. PSPO enhances the standard @@@@1@@@@ (PPO) framework by aligning the granularity of optimization with the structure of the agent-environment interaction, ensuring that credit assignment, policy updates, and reward signals respect the atomic units of complex sequential tasks. PSPO has been empirically validated in domains such as academic paper search with the PaperScout agent and in conditional sequence generation, demonstrating improved sample efficiency, stability, and final-task performance compared to both vanilla policy gradient and token-level PPO baselines (Pan et al., 15 Jan 2026, Tuan et al., 2018).
1. Motivation and Background
Multi-turn, agentic tasks, such as LLM-driven academic paper search or sophisticated dialogue systems, intrinsically decompose into discrete interaction turns, each producing complex, high-dimensional action sequences (e.g., full responses or tool calls) and yielding sparse, delayed rewards. Standard reinforcement learning strategies, including token-level PPO, are mismatched for such domains: they distribute sequence-level rewards across individual tokens, leading to noisy credit assignment and instability in value learning. Outcome-level RL methods, such as GRPO or GSPO, operate at the trajectory level, disregarding valuable intermediate process signals and exacerbating credit assignment challenges (Pan et al., 15 Jan 2026).
PSPO addresses this granularity mismatch by considering entire sequences (e.g., an agent’s full response at each turn) as atomic actions and aligning both policy and value updates to the sequence level. This process-aware optimization enables more precise reward attribution, better utilization of process signals, and improved synergy between RL optimization and the task’s structure (Pan et al., 15 Jan 2026).
2. Formal PSPO Objectives and Algorithm
PSPO is formulated within a partially observable Markov decision process (POMDP), where, at each turn , the agent observes state , outputs an action (a full sequence), executes any necessary tool-calls, and receives a turn-level reward . Both policy and value networks are updated at the sequence level using generalized advantage estimation (GAE).
Rewards and Returns:
- For tasks such as academic paper retrieval, the reward is defined as:
where is a relevance score and controls penalty for redundant tool usage.
- Returns and advantages are computed as:
Surrogate Objectives:
- Actor loss: The sequence-level importance ratio is
yielding the clipped surrogate objective:
- Critic loss: Using normalized returns across the batch,
- PSPO uses asymmetric clipping for enhanced exploration and stability.
Algorithmic Steps:
- Collect trajectories per epoch, each with turns.
- At each turn, generate response , execute tool calls, record .
- Compute returns , temporal differences , and advantages .
- Normalize .
- Optimize and (optionally with value pre-training, gradient clipping, learning-rate scheduling).
This approach is process-aware, with rewards and updates aligned to sequence-level agent actions, unlike token-level PPO (Pan et al., 15 Jan 2026).
3. PSPO in Sequence Generation: Standard and Dynamic Variants
Proximal Sequence Policy Optimization can also denote adaptation of PPO for sequence-to-sequence (seq2seq) RL problems, as presented in prior work (Tuan et al., 2018). In standard PSPO, the policy selects tokens incrementally, but reward assignment and policy updates can be at the token or sequence level.
Standard PPO for Sequences:
- The token-level importance ratio at time is .
- Main surrogate objective:
PPO-dynamic Variant:
- Clipping bounds are adapted based on action probability, expanding the trust region for less-probable actions to encourage exploration:
- Dynamic surrogate objective:
This approach accelerates convergence by loosening constraints where the old policy is uncertain and enforces stricter updates where it is more confident (Tuan et al., 2018).
4. Theoretical Properties and Empirical Behavior
PSPO’s advantages stem from improved alignment of optimization granularity, reward attribution, and stability mechanisms:
- Improved Credit Assignment: Reward signals are attributed to whole responses, mitigating the signal dilution that arises in token-level reward distribution.
- Process-Awareness: Intermediate rewards at each sequence step enable richer and more informative value updates, in contrast to outcome-only optimization that disregards pivotal process steps.
- Variance Reduction and Stability: Top- reward aggregation, running mean-variance return normalization, and asymmetric or dynamic clipping produce more stable actor gradients and diminish variance across heterogeneous query difficulties.
- Empirical Convergence: In academic paper retrieval and synthetic sequence-generation, PSPO converges faster and to better final returns compared to both token-level PPO and outcome-level methods like GSPO (Pan et al., 15 Jan 2026, Tuan et al., 2018).
5. Experimental Evaluation
Comprehensive empirical studies validate PSPO and its variants on both synthetic and real-world tasks.
PaperScout Experiments (Pan et al., 15 Jan 2026):
- Benchmarks: AutoScholarQuery (33k synthetic queries), RealScholarQuery (50 expert-curated queries).
- Retrieval Setup: Tool-based search/expand on Milvus/ar5iv during training, Google Search for test evaluation.
- Baselines: Google Search, Google Scholar, fixed-workflow RL (PaSa, SPAR), PPO, GSPO (on Qwen3–4B).
- Metrics: Precision, Recall, F1; LLM-Score (0–3 by three large models).
- Key Results:
- PSPO-trained PaperScout achieves recall of 0.574, LLM-score of 2.576, outperforming GSPO (Recall 0.557, LLM-score 2.510), PPO (Recall 0.537, LLM-score 2.417).
- PSPO displays sample efficiency, robust retrieval with limited tool calls, and stable optimization (lower actor gradient norms, lower critic loss).
- Ablations: Reverting to token-level PPO or outcome-level GSPO degrades stability and final performance; intermediate process-aware rewards are critical.
Sequence Generation Studies (Tuan et al., 2018):
- Synthetic Counting:
- PPO-dynamic achieves 98.62% precision (vs. 98.42% PPO; REINFORCE slightly higher but with mode collapse).
- PPO-dynamic diversifies outputs and accelerates convergence relative to fixed-ε PPO.
- Chatbot (OpenSubtitles):
- PPO-dynamic attains the highest BLEU-2 (14.73; vs. 14.12 PPO, 14.29 REINFORCE) and rapid convergence.
- Stability: Both standard and dynamic PSPO outperform REINFORCE in variance reduction, convergence speed, and output diversity.
6. Practical Implementation Guidelines
Recommended practices for PSPO deployment in sequence generation and agentic tasks include:
- Pretraining policies by maximum-likelihood to stabilize early-stage learning.
- Tuning clipping parameters ( for PPO, for PPO-dynamic) based on validation performance; values of and have been validated.
- Employing learned value baselines () to reduce advantage estimation variance.
- Monitoring empirical KL-divergence between policy iterations to prevent premature policy drift.
- In adversarial RL (e.g., SeqGAN), replacing vanilla policy gradient with PSPO enhances GAN training stability (Tuan et al., 2018).
- For multi-turn, process-driven domains, aligning reward granularity and value estimation with the actual agentic turns is essential for both convergence and final-task performance (Pan et al., 15 Jan 2026).
7. Implications and Distinctions
PSPO unifies and extends PPO-based RL for both discrete sequence generation and multi-turn agentic settings by emphasizing process-aware, sequence-level optimization. It mitigates the main deficiencies of token-level optimizers (credit assignment, instability under sparse rewards) and outcome-level optimizers (discarded process signals), enabling stable, sample-efficient, high-performance RL across diverse, high-dimensional sequence tasks.
A plausible implication is that as agentic AI tasks increase in complexity, frameworks explicitly accounting for process structure and reward granularity, such as PSPO, are likely to underlie state-of-the-art autonomous agents in information retrieval, dialogue, and tool-driven reasoning (Pan et al., 15 Jan 2026).