Group Sub-sequence Policy Optimization (GSsPO)

Updated 8 February 2026

GSsPO is a reinforcement learning method that optimizes multi-turn agentic workflows by aligning gradient updates with semantically coherent Think-Action cycles.
It recalibrates optimization granularity to sub-sequences, preserving logical structure and enhancing credit assignment compared to token- or full-sequence methods.
Empirical results show GSsPO achieves smoother convergence and superior performance across QA benchmarks, making it effective for complex decision-making tasks.

Group Sub-sequence Policy Optimization (GSsPO) is a structure-aware reinforcement learning (RL) algorithm designed to optimize policies for multi-turn agentic workflows by aligning gradient updates with semantically meaningful decision units—namely, sub-sequences corresponding to atomic Think-Action cycles. In contrast to conventional token-level or full-sequence reinforcement learning, GSsPO recalibrates optimization granularity to sub-sequences, preserving the logical and causal structure of agentic interaction and improving credit assignment in complex reasoning tasks (Kong et al., 1 Feb 2026).

1. Motivation and Background

The emergence of LLM-based agentic workflows has shown significant advances in addressing multi-step reasoning and tool-augmented tasks. Traditional workflow synthesis paradigms typically adopt a one-shot, open-loop, code-centric approach, treating the decision process as monolithic program generation followed by execution. This paradigm—the “Static Execution Trap”—precludes conditioning on intermediate observations and ties optimization to either token-level (e.g., GRPO) or entire-sequence-level (e.g., GSPO) units. Token-level RL, by updating each output token independently, fragments the Think-Action semantic structure and disrupts inter-step dependencies. Sequence-level RL, while maintaining holistic semantic integrity, conflates credit assignment across many steps, often obscuring which segments contribute to successful outcomes. GSsPO specifically addresses this granularity mismatch by targeting sub-sequences that directly map to reasoning–action cycles, thereby preserving semantic coherence and enabling robust credit propagation (Kong et al., 1 Feb 2026).

2. Formalization and Core Concepts

Let $x \in \mathcal{D}$ denote the input prompt. At each turn $k$ , the agent maintains a state $s_k$ capturing the history of prior Think-Action segments and tool outputs. The agent emits a “think” segment $y^{(k)}_{\text{think}}$ (rationale) and then an “action” segment $y^{(k)}_{\text{act}}$ (tool call or answer), forming an atomic sub-sequence. The policy $\pi_\theta$ is factorized autoregressively: $\pi_\theta(y|x) = \prod_{t=1}^T \pi_\theta(y_t \mid y_{<t}, x).$ A sampled response $y_i$ can be decomposed as a sequence of sub-sequences $\mathcal{S}_i = \{s: s \text{ is one Think-Action turn in } y_i\}$ . A group of $G$ such trajectories is $\{y_i\}_{i=1}^G \sim \pi_{\theta_{\text{old}}}(\cdot|x)$ . Sub-sequences are token-contiguous, and each has length $|s|$ .

3. Algorithmic Formulation

GSsPO’s core objective targets group-averaged, sub-sequence–normalized, clipped policy gradients in the spirit of PPO, but at the sub-sequence level: $\mathcal{J}_{\text{GSsPO}}(\theta) =\mathbb{E}_{x,\{y_i\}} \left[ \frac{1}{G} \sum_{i=1}^{G} \frac{1}{|\mathcal{S}_i|} \sum_{s \in \mathcal{S}_i} \min \left( r_s(\theta) \widehat{A}_i, \;\mathrm{clip}(r_s(\theta), 1-\varepsilon, 1+\varepsilon)\widehat{A}_i \right) \right]$ where $r_s(\theta) = \left(\prod_{t \in s} \frac{\pi_\theta(y_t|y_{<t}, x)}{\pi_{\theta_{\text{old}}}(y_t|y_{<t}, x)}\right)^{1/|s|}$ is the geometric mean importance ratio for the sub-sequence, and $\widehat{A}_i$ is the standardized advantage for trajectory $i$ , normalized within its group. The per-token policy gradient is averaged over tokens within $s$ , then over sub-sequences and group members: $\nabla_\theta J(\theta) = \mathbb{E} \left[ \frac{1}{G} \sum_{i=1}^G \frac{1}{|\mathcal{S}_i|} \sum_{s \in \mathcal{S}_i} \widehat{A}_i r_s(\theta) \nabla_\theta \log r_s(\theta) \right]$ with

$\nabla_\theta \log r_s(\theta) = \frac{1}{|s|} \sum_{t \in s} \nabla_\theta \log \pi_\theta(y_t \mid y_{<t}, x)$

This averaging ensures sub-sequences are not penalized for verbosity and credit is assigned at the correct semantic granularity (Kong et al., 1 Feb 2026).

4. Algorithmic Workflow and Implementation

The standard GSsPO loop proceeds as follows:

Sample a batch of prompts $\{x_b\}_{b=1}^B$ .
For each $x_b$ , sample $G$ trajectories under $\pi_{\theta_{\text{old}}}$ .
Parse sub-sequences $\mathcal{S}_i$ and compute per-trajectory rewards $r(x_b, y_i)$ , where $r(x, y) = R_{\text{format}}(y) + R_{\text{outcome}}(y)$ .
Group-wise standardization yields $\widehat{A}_i$ .
For each sub-sequence, compute $r_s(\theta)$ , surrogate loss $L_s$ , and the averaged gradient $\nabla \log r_s(\theta)$ .
Accumulate gradients, perform optimizer step, and update $\theta_{\text{old}} \leftarrow \theta$ .

Parsing sub-sequences is operationalized via unambiguous tags demarcating Think and Action segments (“> ”...“<think>”, “<tool>”...“<tool>” markers). A separate reference policy imposes a KL penalty for regularization (Kong et al., 1 Feb 2026).

5. Theoretical Significance

By aligning updates with atomic reasoning units, GSsPO provides finer-grained credit assignment than sequence-level RL, while avoiding the instability of token-level methods. The structure-aware surrogate maintains PPO-style monotonic improvement guarantees under bounded importance ratios and adequate exploration. Empirically, GSsPO demonstrates faster and more stable convergence compared with both token-level (e.g., GRPO) and sequence-level (e.g., GSPO) baselines (as shown in reward and convergence plots), and ablation studies validate the superiority of the sub-sequence granularity. The algorithm is generalizable to any multi-turn agentic process that can be factored into semantically coherent, atomic cycles—such as conversational agents, plan–execute–observe planning systems, or interactive theorem proving (Kong et al., 1 Feb 2026).

6. Empirical Validation and Applications

Workflow-R1 with GSsPO has been validated on seven QA benchmarks: NaturalQuestions, TriviaQA, PopQA (general QA), and HotpotQA, 2WikiMultiHopQA, MuSiQue, Bamboogle (multi-hop QA). Against direct inference, chain-of-thought, self-consistency, MedPrompt, agentic workflows MaAS/AFlow, and search-augmented RLHF baselines, Workflow-R1+GSsPO achieves superior performance (average exact-match gain ∼0.03–0.05). Workflow-R1-Search further raises performance by composing in search-based operators, demonstrating the extensibility of GSsPO-based optimization. Ablations confirm GSsPO $>$ GSPO $>$ GRPO across standard and search-augmented settings. GSsPO also yields smoother convergence and mitigates reward-plateauing. A plausible implication is that atomic sub-sequence optimization is optimal for agentic workflows demarcated by natural reasoning–action boundaries (Kong et al., 1 Feb 2026).

7. Comparison to Related Methods

GSsPO is situated between token-level (GRPO) and group-level sequence RL (GSPO). Agent-GSPO (Fan et al., 26 Oct 2025) leverages the GSPO framework for communication-efficient multi-agent systems but operates at the level of entire sequences, trading token-level operations for memory efficiency and enabling optimization for token economy via communication-aware rewards. In contrast, GSsPO (as implemented in Workflow-R1) uses sub-sequence–level clipping and normalization to align with the Think-Action cycle, yielding improved credit propagation and semantic alignment in agentic decision-making. Both GSsPO and GSPO employ group-wise advantage normalization, clipped surrogate objectives, and length-normalized importance ratios, but differ fundamentally in the atomic unit of optimization. GSsPO is thus distinct in its semantic granularity and is empirically validated as a more effective structure-aware RL solution in multi-turn workflow construction (Kong et al., 1 Feb 2026, Fan et al., 26 Oct 2025).

In summary, Group Sub-sequence Policy Optimization provides a principled, empirically validated approach for aligning RL optimization with the atomic semantics of agentic workflows, overcoming the granularity and credit assignment challenges innate to token- and sequence-level RL. Its formalization enables robust, sample-efficient learning and generalization to a broad spectrum of multi-turn decision-making tasks (Kong et al., 1 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (2)

Workflow-R1: Group Sub-sequence Policy Optimization for Multi-turn Workflow Construction (2026)

Agent-GSPO: Communication-Efficient Multi-Agent Systems via Group Sequence Policy Optimization (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Group Sub-sequence Policy Optimization (GSsPO).

Group Sub-sequence Policy Optimization (GSsPO)

1. Motivation and Background

2. Formalization and Core Concepts

3. Algorithmic Formulation

4. Algorithmic Workflow and Implementation

5. Theoretical Significance

6. Empirical Validation and Applications

7. Comparison to Related Methods

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics