Owen-Shapley Policy Optimization

Updated 20 January 2026

OSPO is a reinforcement learning framework for LLMs that assigns token-level credit to impactful text segments, addressing sparse reward challenges.
It integrates potential-based reward shaping with Shapley-Owen attributions to precisely redistribute rewards and eliminate length bias.
OSPO demonstrates enhanced sample efficiency and robustness, achieving up to 25% NDCG improvement over GRPO on benchmark generative search tasks.

Owen-Shapley Policy Optimization (OSPO) is a reinforcement learning (RL) framework designed for LLMs operating in generative search and personalized recommendation settings. OSPO addresses the token-level credit assignment gap inherent in sequence-level reward regimes by leveraging principled reward redistribution techniques rooted in Shapley-Owen attributions over semantically coherent text segments. This approach enables direct credit assignment to causally impactful segments of generated language, improving sample efficiency and robustness, particularly in domains with latent user intent and sparse reward signals (Nath et al., 13 Jan 2026).

1. Formal Foundation and Sequence-Level Credit Assignment

OSPO frames single-turn LLM response generation as an episodic Markov Decision Process (MDP):

States $s_t \in \mathcal{S}$ : Partial sequences $(x, y_{<t})$ , where $x$ is the input context and $y_{<t} = (y_1, \ldots, y_{t-1})$ the previously generated tokens.
Actions $a_t \in \mathcal{A}$ : Selection of the next token $y_t$ ; transitions are deterministic: $s_{t+1} = (x, y_{\le t})$ .
Trajectories $\tau = (s_0, a_0, \ldots, s_T, a_T)$ : Correspond to full generations $y = (y_1, \ldots, y_T)$ sampled under policy $\pi_\theta$ .
Rewards: Sparse and terminal ( $r_t = 0$ for $t < T$ , $r_T = R(\tau) = r(x,y)$ ) and discount factor $\gamma = 1$ .

The RL objective is maximizing expected terminal reward:

$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} [R(\tau)]$

The gradient estimator utilizes sequence-level advantage:

$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot A(\tau) \right]$

with $A(\tau) = R(\tau) - b$ for baseline $b$ .

Standard methods such as GRPO employ group-relative normalization at the sequence level, yielding $A(\tau) = R(\tau) - \bar R$ where $\bar R$ is the group mean. This creates a credit assignment gap: the RL signal does not resolve which subsequences specifically drive reward.

2. Potential-Based Reward Shaping and Token-Level Redistribution

OSPO adopts potential-based reward shaping [Ng et al., 1999]. For any shaping function $\phi(s)$ over states:

$\hat r(s,a,s') = r(s,a) + \phi(s') - \phi(s)$

Optimal policies are invariant to $\phi$ due to telescoping over episodes. In OSPO, segment and token-level attributions $\phi_t$ act as reward shaping potentials over partial sequences, assigning token-level credit without distorting policy optimality.

Redistribution preserves length invariance:

$A_t^{(g)} = T \cdot \tilde \phi_t^{(g)} \cdot \hat A^{(g)}$

with normalized $\tilde \phi_t$ summing to one, ensuring $(1/T) \sum_t A_t^{(g)} = \hat A^{(g)}$ (Lemma 1). This eliminates length bias in token-level RL signals.

3. Shapley-Owen Attribution for Segment Credit Assignment

OSPO decomposes model outputs into $N$ semantically coherent segments $A = \{a_1, \ldots, a_N\}$ (e.g., phrases or sentences), and forms coalitions for reward evaluation:

Characteristic function: $v(S) = r(x, \text{concat}_{j \in S} a_j)$ for any coalition $S \subseteq \{1, \ldots, N\}$ .
Classical Shapley value for segment $i$ :

$\phi_i = \sum_{S \subseteq N \setminus \{i\}} \frac{|S|! (N - |S| - 1)!}{N!} [v(S \cup \{i\}) - v(S)]$

Owen value (contiguous subset restriction):

$C_i = \{ \text{all contiguous } S \not\ni i \text{ with } S \cup \{i\} \text{ contiguous} \}$

$\phi_i^{Owen} = \frac{1}{|C_i|} \sum_{S \in C_i} [v(S \cup \{i\}) - v(S)]$

The contiguous-span restriction lowers computational cost to $O(N \cdot w_{max})$ , where $w_{max}$ is the maximum coalition width sampled per rollout, while preserving fairness axioms: efficiency ( $\sum_i \phi_i = R$ ), symmetry, and linearity.

Segment-level $\phi_i$ is projected onto token-level attributions $\phi_t$ via a binary span-projection matrix $W \in \{0,1\}^{N \times T}$ : $\phi_{tok} = W^\top \phi_{seg}$ .

4. OSPO Optimization Objective and Workflow

For each prompt, $G$ completions are sampled. Sequence-level advantages are group-normalized:

$\hat A^{(g)} = \frac{R^{(g)} - \bar R}{\sigma_R}$

Token-level attribution is obtained by normalizing and redistributing the per-sequence advantage:

$A_t^{(g)} = T \cdot \tilde \phi_t^{(g)} \cdot \hat A^{(g)}$

The PPO-style surrogate objective accumulates clipped advantages:

$J^{OSPO}(\theta) = \frac{1}{G} \sum_{g=1}^G \frac{1}{T} \sum_{t=1}^T \min\left( \rho_t^{(g)} A_t^{(g)},~ \text{clip}(\rho_t^{(g)}, 1-\epsilon, 1+\epsilon) A_t^{(g)}\right)$

with importance ratio $\rho_t^{(g)} = \frac{\pi_\theta(y_t^{(g)}|\dots)}{\pi_{\theta_{old}}(y_t^{(g)}|\dots)}$ .

Pseudocode proceeds as follows: for each context $x$ , sample $G$ generations, compute sequence rewards, segment outputs, evaluate up to $M$ contiguous coalitions, calculate $\phi_i^{Owen}$ , project and normalize to tokens, redistribute group-relative advantage, compute PPO-style ratios, and apply policy gradient updates.

5. Theoretical Guarantees

Policy Optimality: Potential-based shaping guarantees OSPO’s redistribution does not affect the set of optimal policies.
Length Invariance: Lemma 1 states that redistributed token advantages average to the original sequence-level advantage, eliminating token count bias.
Convergence: Under standard PPO assumptions (bounded ratios, clipped objective), OSPO inherits monotonic improvement and convergence guarantees as established by PPO frameworks.

6. Empirical Evaluation and Benchmark Results

Experiments on generative search and summarization tasks were conducted over:

Amazon ESCI (shopping queries): Query expansion for dense retrieval (retrievers: FAISS + all-mpnet-base-v2).
H&M Fashion:
- Contextualized product search (expert LLM-synthesized queries, retrieval via SIMCSE-Large).
- User profile summarization: Chain-of-Thought response, candidate ranking (reward: Bradley-Terry head + format reward).

Baselines include SFT (supervised finetuning), DPO (contrastive preference optimization), and GRPO (group-relative PPO).

Metrics:

Task	Metric	OSPO-Prop (7B)	GRPO	Large Model (Qwen-2.5-72B)
ESCI Search	NDCG	0.522	0.418	0.543
H&M Fashion Search	NDCG	0.436	0.379	—
Summarization	Win-rate (pairwise, LLM judge)	49–54%	—	—

OSPO demonstrates +25% relative improvement in NDCG over GRPO on ESCI, approaches metric parity with a much larger Qwen-2.5-72B model, and displays enhanced win-rates in summarization. In out-of-distribution (OOD) retriever shifts (all-mpnet $\leftrightarrow$ SIMCSE), OSPO-Prop retained superior relative performance compared to GRPO and offline baselines.

7. Practical Considerations and Insights

OSPO’s fine-grained token-level attributions via Owen-Shapley values yield several key benefits:

Sample Efficiency & Interpretability: Direct assignment of credit to causally impactful segments in sparse-reward settings enhances learning and model transparency.
Robustness & Anti-hacking: Concentrating gradient updates on utility-driving text segments reduces susceptibility to reward hacking and spurious pattern overfitting.
Coalition Design: Performance is sensitive to coalition width ( $w_{max}$ ) and budget ( $M$ ); moderate contiguous widths ( $w \approx 4–8$ , $M \approx 32–96$ ) optimize the trade-off between variance reduction and context preservation.
Computational Overhead: OSPO incurs $M$ extra reward evaluations per rollout, remaining practical with fast retrieval-based rewards, but more resource-intensive with slower reward models. Hyperparameter tuning is essential.
Black-box Feedback Suitability: OSPO does not require a learned value network, integrating seamlessly with PPO-style clipping and group normalization, which is advantageous in environments where reward models or retrievers are closed-source.

A plausible implication is that OSPO’s segment- and token-level credit assignment paradigm may extend to broader reasoning tasks in language modeling where sequence-level rewards and latent user intent dominate.

In summary, Owen-Shapley Policy Optimization leverages potential-based reward shaping and semantically coherent segment attributions to deliver principled, interpretable, and sample-efficient RL for generative LLMs, bridging the granularity gap in sequence-level supervisory regimes and achieving state-of-the-art results on targeted search and summarization tasks (Nath et al., 13 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Owen-Shapley Policy Optimization (OSPO): A Principled RL Algorithm for Generative Search LLMs (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Owen-Shapley Policy Optimization (OSPO).