Papers
Topics
Authors
Recent
Search
2000 character limit reached

Group Sub-sequence Policy Optimization (GSsPO)

Updated 8 February 2026
  • GSsPO is a reinforcement learning method that optimizes multi-turn agentic workflows by aligning gradient updates with semantically coherent Think-Action cycles.
  • It recalibrates optimization granularity to sub-sequences, preserving logical structure and enhancing credit assignment compared to token- or full-sequence methods.
  • Empirical results show GSsPO achieves smoother convergence and superior performance across QA benchmarks, making it effective for complex decision-making tasks.

Group Sub-sequence Policy Optimization (GSsPO) is a structure-aware reinforcement learning (RL) algorithm designed to optimize policies for multi-turn agentic workflows by aligning gradient updates with semantically meaningful decision units—namely, sub-sequences corresponding to atomic Think-Action cycles. In contrast to conventional token-level or full-sequence reinforcement learning, GSsPO recalibrates optimization granularity to sub-sequences, preserving the logical and causal structure of agentic interaction and improving credit assignment in complex reasoning tasks (Kong et al., 1 Feb 2026).

1. Motivation and Background

The emergence of LLM-based agentic workflows has shown significant advances in addressing multi-step reasoning and tool-augmented tasks. Traditional workflow synthesis paradigms typically adopt a one-shot, open-loop, code-centric approach, treating the decision process as monolithic program generation followed by execution. This paradigm—the “Static Execution Trap”—precludes conditioning on intermediate observations and ties optimization to either token-level (e.g., GRPO) or entire-sequence-level (e.g., GSPO) units. Token-level RL, by updating each output token independently, fragments the Think-Action semantic structure and disrupts inter-step dependencies. Sequence-level RL, while maintaining holistic semantic integrity, conflates credit assignment across many steps, often obscuring which segments contribute to successful outcomes. GSsPO specifically addresses this granularity mismatch by targeting sub-sequences that directly map to reasoning–action cycles, thereby preserving semantic coherence and enabling robust credit propagation (Kong et al., 1 Feb 2026).

2. Formalization and Core Concepts

Let xDx \in \mathcal{D} denote the input prompt. At each turn kk, the agent maintains a state sks_k capturing the history of prior Think-Action segments and tool outputs. The agent emits a “think” segment ythink(k)y^{(k)}_{\text{think}} (rationale) and then an “action” segment yact(k)y^{(k)}_{\text{act}} (tool call or answer), forming an atomic sub-sequence. The policy πθ\pi_\theta is factorized autoregressively: πθ(yx)=t=1Tπθ(yty<t,x).\pi_\theta(y|x) = \prod_{t=1}^T \pi_\theta(y_t \mid y_{<t}, x). A sampled response yiy_i can be decomposed as a sequence of sub-sequences Si={s:s is one Think-Action turn in yi}\mathcal{S}_i = \{s: s \text{ is one Think-Action turn in } y_i\}. A group of GG such trajectories is {yi}i=1Gπθold(x)\{y_i\}_{i=1}^G \sim \pi_{\theta_{\text{old}}}(\cdot|x). Sub-sequences are token-contiguous, and each has length s|s|.

3. Algorithmic Formulation

GSsPO’s core objective targets group-averaged, sub-sequence–normalized, clipped policy gradients in the spirit of PPO, but at the sub-sequence level: JGSsPO(θ)=Ex,{yi}[1Gi=1G1SisSimin(rs(θ)A^i,  clip(rs(θ),1ε,1+ε)A^i)]\mathcal{J}_{\text{GSsPO}}(\theta) =\mathbb{E}_{x,\{y_i\}} \left[ \frac{1}{G} \sum_{i=1}^{G} \frac{1}{|\mathcal{S}_i|} \sum_{s \in \mathcal{S}_i} \min \left( r_s(\theta) \widehat{A}_i, \;\mathrm{clip}(r_s(\theta), 1-\varepsilon, 1+\varepsilon)\widehat{A}_i \right) \right] where rs(θ)=(tsπθ(yty<t,x)πθold(yty<t,x))1/sr_s(\theta) = \left(\prod_{t \in s} \frac{\pi_\theta(y_t|y_{<t}, x)}{\pi_{\theta_{\text{old}}}(y_t|y_{<t}, x)}\right)^{1/|s|} is the geometric mean importance ratio for the sub-sequence, and A^i\widehat{A}_i is the standardized advantage for trajectory ii, normalized within its group. The per-token policy gradient is averaged over tokens within ss, then over sub-sequences and group members: θJ(θ)=E[1Gi=1G1SisSiA^irs(θ)θlogrs(θ)]\nabla_\theta J(\theta) = \mathbb{E} \left[ \frac{1}{G} \sum_{i=1}^G \frac{1}{|\mathcal{S}_i|} \sum_{s \in \mathcal{S}_i} \widehat{A}_i r_s(\theta) \nabla_\theta \log r_s(\theta) \right] with

θlogrs(θ)=1stsθlogπθ(yty<t,x)\nabla_\theta \log r_s(\theta) = \frac{1}{|s|} \sum_{t \in s} \nabla_\theta \log \pi_\theta(y_t \mid y_{<t}, x)

This averaging ensures sub-sequences are not penalized for verbosity and credit is assigned at the correct semantic granularity (Kong et al., 1 Feb 2026).

4. Algorithmic Workflow and Implementation

The standard GSsPO loop proceeds as follows:

  1. Sample a batch of prompts {xb}b=1B\{x_b\}_{b=1}^B.
  2. For each xbx_b, sample GG trajectories under πθold\pi_{\theta_{\text{old}}}.
  3. Parse sub-sequences Si\mathcal{S}_i and compute per-trajectory rewards r(xb,yi)r(x_b, y_i), where r(x,y)=Rformat(y)+Routcome(y)r(x, y) = R_{\text{format}}(y) + R_{\text{outcome}}(y).
  4. Group-wise standardization yields A^i\widehat{A}_i.
  5. For each sub-sequence, compute rs(θ)r_s(\theta), surrogate loss LsL_s, and the averaged gradient logrs(θ)\nabla \log r_s(\theta).
  6. Accumulate gradients, perform optimizer step, and update θoldθ\theta_{\text{old}} \leftarrow \theta.

Parsing sub-sequences is operationalized via unambiguous tags demarcating Think and Action segments (“> ”...“<think>”, “<tool>”...“<tool>” markers). A separate reference policy imposes a KL penalty for regularization (Kong et al., 1 Feb 2026).

5. Theoretical Significance

By aligning updates with atomic reasoning units, GSsPO provides finer-grained credit assignment than sequence-level RL, while avoiding the instability of token-level methods. The structure-aware surrogate maintains PPO-style monotonic improvement guarantees under bounded importance ratios and adequate exploration. Empirically, GSsPO demonstrates faster and more stable convergence compared with both token-level (e.g., GRPO) and sequence-level (e.g., GSPO) baselines (as shown in reward and convergence plots), and ablation studies validate the superiority of the sub-sequence granularity. The algorithm is generalizable to any multi-turn agentic process that can be factored into semantically coherent, atomic cycles—such as conversational agents, plan–execute–observe planning systems, or interactive theorem proving (Kong et al., 1 Feb 2026).

6. Empirical Validation and Applications

Workflow-R1 with GSsPO has been validated on seven QA benchmarks: NaturalQuestions, TriviaQA, PopQA (general QA), and HotpotQA, 2WikiMultiHopQA, MuSiQue, Bamboogle (multi-hop QA). Against direct inference, chain-of-thought, self-consistency, MedPrompt, agentic workflows MaAS/AFlow, and search-augmented RLHF baselines, Workflow-R1+GSsPO achieves superior performance (average exact-match gain ∼0.03–0.05). Workflow-R1-Search further raises performance by composing in search-based operators, demonstrating the extensibility of GSsPO-based optimization. Ablations confirm GSsPO >> GSPO >> GRPO across standard and search-augmented settings. GSsPO also yields smoother convergence and mitigates reward-plateauing. A plausible implication is that atomic sub-sequence optimization is optimal for agentic workflows demarcated by natural reasoning–action boundaries (Kong et al., 1 Feb 2026).

GSsPO is situated between token-level (GRPO) and group-level sequence RL (GSPO). Agent-GSPO (Fan et al., 26 Oct 2025) leverages the GSPO framework for communication-efficient multi-agent systems but operates at the level of entire sequences, trading token-level operations for memory efficiency and enabling optimization for token economy via communication-aware rewards. In contrast, GSsPO (as implemented in Workflow-R1) uses sub-sequence–level clipping and normalization to align with the Think-Action cycle, yielding improved credit propagation and semantic alignment in agentic decision-making. Both GSsPO and GSPO employ group-wise advantage normalization, clipped surrogate objectives, and length-normalized importance ratios, but differ fundamentally in the atomic unit of optimization. GSsPO is thus distinct in its semantic granularity and is empirically validated as a more effective structure-aware RL solution in multi-turn workflow construction (Kong et al., 1 Feb 2026, Fan et al., 26 Oct 2025).


In summary, Group Sub-sequence Policy Optimization provides a principled, empirically validated approach for aligning RL optimization with the atomic semantics of agentic workflows, overcoming the granularity and credit assignment challenges innate to token- and sequence-level RL. Its formalization enables robust, sample-efficient learning and generalization to a broad spectrum of multi-turn decision-making tasks (Kong et al., 1 Feb 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Group Sub-sequence Policy Optimization (GSsPO).