Papers
Topics
Authors
Recent
Search
2000 character limit reached

Proximal Sequence Policy Optimization (PSPO)

Updated 22 January 2026
  • PSPO is a reinforcement learning framework that treats whole response sequences as atomic actions, improving reward attribution and stability.
  • It aligns policy and value updates to the sequence level, mitigating noise from token-level credit assignment in complex, multi-turn tasks.
  • Empirical evaluation shows PSPO achieves higher sample efficiency and robust performance in applications like academic paper retrieval and conditional sequence generation.

Proximal Sequence Policy Optimization (PSPO) refers to a class of reinforcement learning (RL) algorithms designed for robust, stable optimization in sequence generation and multi-turn agentic tasks. PSPO enhances the standard @@@@1@@@@ (PPO) framework by aligning the granularity of optimization with the structure of the agent-environment interaction, ensuring that credit assignment, policy updates, and reward signals respect the atomic units of complex sequential tasks. PSPO has been empirically validated in domains such as academic paper search with the PaperScout agent and in conditional sequence generation, demonstrating improved sample efficiency, stability, and final-task performance compared to both vanilla policy gradient and token-level PPO baselines (Pan et al., 15 Jan 2026, Tuan et al., 2018).

1. Motivation and Background

Multi-turn, agentic tasks, such as LLM-driven academic paper search or sophisticated dialogue systems, intrinsically decompose into discrete interaction turns, each producing complex, high-dimensional action sequences (e.g., full responses or tool calls) and yielding sparse, delayed rewards. Standard reinforcement learning strategies, including token-level PPO, are mismatched for such domains: they distribute sequence-level rewards across individual tokens, leading to noisy credit assignment and instability in value learning. Outcome-level RL methods, such as GRPO or GSPO, operate at the trajectory level, disregarding valuable intermediate process signals and exacerbating credit assignment challenges (Pan et al., 15 Jan 2026).

PSPO addresses this granularity mismatch by considering entire sequences (e.g., an agent’s full response at each turn) as atomic actions and aligning both policy and value updates to the sequence level. This process-aware optimization enables more precise reward attribution, better utilization of process signals, and improved synergy between RL optimization and the task’s structure (Pan et al., 15 Jan 2026).

2. Formal PSPO Objectives and Algorithm

PSPO is formulated within a partially observable Markov decision process (POMDP), where, at each turn tt, the agent observes state xtx_t, outputs an action yty_t (a full sequence), executes any necessary tool-calls, and receives a turn-level reward rtr_t. Both policy πθ\pi_\theta and value VϕV_\phi networks are updated at the sequence level using generalized advantage estimation (GAE).

Rewards and Returns:

  • For tasks such as academic paper retrieval, the reward is defined as:

rt=ptop-k(Vt)ρ(p)ηcCt1[c was used before]r_t = \sum_{p \in \text{top-}k(V_t)} \rho(p) - \eta \sum_{c \in C_t} \mathbf{1}[c \textrm{ was used before}]

where ρ(p)[0,1]\rho(p) \in [0, 1] is a relevance score and η\eta controls penalty for redundant tool usage.

  • Returns and advantages are computed as:

Rt=l=0Tt1γlrt+l δt=rt+γVϕ(xt+1)Vϕ(xt) A^t=l=0Tt1(γλ)lδt+lR_t = \sum_{l=0}^{T-t-1} \gamma^l r_{t+l} \ \delta_t = r_t + \gamma V_\phi(x_{t+1}) - V_\phi(x_t) \ \hat{A}_t = \sum_{l=0}^{T-t-1} (\gamma \lambda)^l \delta_{t+l}

Surrogate Objectives:

  • Actor loss: The sequence-level importance ratio is

wt(θ)=πθ(ytxt)πθold(ytxt)w_t(\theta) = \frac{\pi_\theta(y_t|x_t)}{\pi_{\theta_\text{old}}(y_t|x_t)}

yielding the clipped surrogate objective:

Lactor(θ)=Eτπθold[min(wt(θ)A^t, clip(wt(θ),1ϵlow,1+ϵhigh)A^t)]L_\text{actor}(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta_\text{old}}} [ \min\left(w_t(\theta)\hat{A}_t,\ \mathrm{clip}(w_t(\theta), 1-\epsilon_\text{low}, 1+\epsilon_\text{high}) \hat{A}_t\right) ]

  • Critic loss: Using normalized returns across the batch,

Lcritic(ϕ)=Eτπ[(Vϕ(xt)Norm(Rt))2]L_\text{critic}(\phi) = \mathbb{E}_{\tau \sim \pi}\left[ \left(V_\phi(x_t) - \mathrm{Norm}(R_t)\right)^2 \right]

  • PSPO uses asymmetric clipping [ϵlow,ϵhigh][\epsilon_\text{low}, \epsilon_\text{high}] for enhanced exploration and stability.

Algorithmic Steps:

  1. Collect NN trajectories per epoch, each with TT turns.
  2. At each turn, generate response yty_t, execute tool calls, record (xt,yt,rt,Vt)(x_t, y_t, r_t, V_t).
  3. Compute returns RtR_t, temporal differences δt\delta_t, and advantages A^t\hat{A}_t.
  4. Normalize RtR_t.
  5. Optimize LcriticL_\text{critic} and LactorL_\text{actor} (optionally with value pre-training, gradient clipping, learning-rate scheduling).

This approach is process-aware, with rewards and updates aligned to sequence-level agent actions, unlike token-level PPO (Pan et al., 15 Jan 2026).

3. PSPO in Sequence Generation: Standard and Dynamic Variants

Proximal Sequence Policy Optimization can also denote adaptation of PPO for sequence-to-sequence (seq2seq) RL problems, as presented in prior work (Tuan et al., 2018). In standard PSPO, the policy πθ\pi_\theta selects tokens incrementally, but reward assignment and policy updates can be at the token or sequence level.

Standard PPO for Sequences:

  • The token-level importance ratio at time tt is rt(θ)=πθ(atst)/πθold(atst)r_t(\theta) = \pi_\theta(a_t|s_t)/\pi_{\theta_\text{old}}(a_t|s_t).
  • Main surrogate objective:

LPPO(θ)=E(st,at)πθold[min(rt(θ)At, clip(rt(θ),1ϵ,1+ϵ)At)]L^{\mathrm{PPO}}(\theta) = \mathbb{E}_{(s_t,a_t)\sim \pi_{\theta_\text{old}}}\left[ \min\left(r_t(\theta) A_t,\ \mathrm{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)A_t \right) \right]

PPO-dynamic Variant:

  • Clipping bounds are adapted based on action probability, expanding the trust region for less-probable actions to encourage exploration:

β()=min(β1,β21/πθold(atst)1) α()=min(α1,α21/πθold(atst)1)\beta(\cdot) = \min(\beta_1, \beta_2 \sqrt{1 / \pi_{\theta_\text{old}}(a_t|s_t) - 1}) \ \alpha(\cdot) = \min(\alpha_1, \alpha_2 \sqrt{1 / \pi_{\theta_\text{old}}(a_t|s_t) - 1})

  • Dynamic surrogate objective:

LPPO-dyn(θ)=E(st,at)[min(rt(θ)At, clip(rt(θ),1β(),1+α())At)]L^{\textrm{PPO-dyn}}(\theta) = \mathbb{E}_{(s_t, a_t)} \left[ \min\left(r_t(\theta)A_t,\ \mathrm{clip}(r_t(\theta), 1 - \beta(\cdot), 1 + \alpha(\cdot))A_t \right) \right]

This approach accelerates convergence by loosening constraints where the old policy is uncertain and enforces stricter updates where it is more confident (Tuan et al., 2018).

4. Theoretical Properties and Empirical Behavior

PSPO’s advantages stem from improved alignment of optimization granularity, reward attribution, and stability mechanisms:

  • Improved Credit Assignment: Reward signals are attributed to whole responses, mitigating the signal dilution that arises in token-level reward distribution.
  • Process-Awareness: Intermediate rewards at each sequence step enable richer and more informative value updates, in contrast to outcome-only optimization that disregards pivotal process steps.
  • Variance Reduction and Stability: Top-kk reward aggregation, running mean-variance return normalization, and asymmetric or dynamic clipping produce more stable actor gradients and diminish variance across heterogeneous query difficulties.
  • Empirical Convergence: In academic paper retrieval and synthetic sequence-generation, PSPO converges faster and to better final returns compared to both token-level PPO and outcome-level methods like GSPO (Pan et al., 15 Jan 2026, Tuan et al., 2018).

5. Experimental Evaluation

Comprehensive empirical studies validate PSPO and its variants on both synthetic and real-world tasks.

PaperScout Experiments (Pan et al., 15 Jan 2026):

  • Benchmarks: AutoScholarQuery (33k synthetic queries), RealScholarQuery (50 expert-curated queries).
  • Retrieval Setup: Tool-based search/expand on Milvus/ar5iv during training, Google Search for test evaluation.
  • Baselines: Google Search, Google Scholar, fixed-workflow RL (PaSa, SPAR), PPO, GSPO (on Qwen3–4B).
  • Metrics: Precision, Recall, F1; LLM-Score (0–3 by three large models).
  • Key Results:
    • PSPO-trained PaperScout achieves recall of 0.574, LLM-score of 2.576, outperforming GSPO (Recall 0.557, LLM-score 2.510), PPO (Recall 0.537, LLM-score 2.417).
    • PSPO displays sample efficiency, robust retrieval with limited tool calls, and stable optimization (lower actor gradient norms, lower critic loss).
  • Ablations: Reverting to token-level PPO or outcome-level GSPO degrades stability and final performance; intermediate process-aware rewards are critical.

Sequence Generation Studies (Tuan et al., 2018):

  • Synthetic Counting:
    • PPO-dynamic achieves 98.62% precision (vs. 98.42% PPO; REINFORCE slightly higher but with mode collapse).
    • PPO-dynamic diversifies outputs and accelerates convergence relative to fixed-ε PPO.
  • Chatbot (OpenSubtitles):
    • PPO-dynamic attains the highest BLEU-2 (14.73; vs. 14.12 PPO, 14.29 REINFORCE) and rapid convergence.
  • Stability: Both standard and dynamic PSPO outperform REINFORCE in variance reduction, convergence speed, and output diversity.

6. Practical Implementation Guidelines

Recommended practices for PSPO deployment in sequence generation and agentic tasks include:

  • Pretraining policies by maximum-likelihood to stabilize early-stage learning.
  • Tuning clipping parameters (ϵ\epsilon for PPO, α2=β2\alpha_2=\beta_2 for PPO-dynamic) based on validation performance; values of ϵ[0.1,0.2]\epsilon \in [0.1, 0.2] and α21.0\alpha_2 \approx 1.0 have been validated.
  • Employing learned value baselines (b(st)b(s_t)) to reduce advantage estimation variance.
  • Monitoring empirical KL-divergence between policy iterations to prevent premature policy drift.
  • In adversarial RL (e.g., SeqGAN), replacing vanilla policy gradient with PSPO enhances GAN training stability (Tuan et al., 2018).
  • For multi-turn, process-driven domains, aligning reward granularity and value estimation with the actual agentic turns is essential for both convergence and final-task performance (Pan et al., 15 Jan 2026).

7. Implications and Distinctions

PSPO unifies and extends PPO-based RL for both discrete sequence generation and multi-turn agentic settings by emphasizing process-aware, sequence-level optimization. It mitigates the main deficiencies of token-level optimizers (credit assignment, instability under sparse rewards) and outcome-level optimizers (discarded process signals), enabling stable, sample-efficient, high-performance RL across diverse, high-dimensional sequence tasks.

A plausible implication is that as agentic AI tasks increase in complexity, frameworks explicitly accounting for process structure and reward granularity, such as PSPO, are likely to underlie state-of-the-art autonomous agents in information retrieval, dialogue, and tool-driven reasoning (Pan et al., 15 Jan 2026).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Proximal Sequence Policy Optimization (PSPO).