Owen-Shapley Policy Optimization
- OSPO is a reinforcement learning framework for LLMs that assigns token-level credit to impactful text segments, addressing sparse reward challenges.
- It integrates potential-based reward shaping with Shapley-Owen attributions to precisely redistribute rewards and eliminate length bias.
- OSPO demonstrates enhanced sample efficiency and robustness, achieving up to 25% NDCG improvement over GRPO on benchmark generative search tasks.
Owen-Shapley Policy Optimization (OSPO) is a reinforcement learning (RL) framework designed for LLMs operating in generative search and personalized recommendation settings. OSPO addresses the token-level credit assignment gap inherent in sequence-level reward regimes by leveraging principled reward redistribution techniques rooted in Shapley-Owen attributions over semantically coherent text segments. This approach enables direct credit assignment to causally impactful segments of generated language, improving sample efficiency and robustness, particularly in domains with latent user intent and sparse reward signals (Nath et al., 13 Jan 2026).
1. Formal Foundation and Sequence-Level Credit Assignment
OSPO frames single-turn LLM response generation as an episodic Markov Decision Process (MDP):
- States : Partial sequences , where is the input context and the previously generated tokens.
- Actions : Selection of the next token ; transitions are deterministic: .
- Trajectories : Correspond to full generations sampled under policy .
- Rewards: Sparse and terminal ( for , ) and discount factor .
The RL objective is maximizing expected terminal reward:
The gradient estimator utilizes sequence-level advantage:
with for baseline .
Standard methods such as GRPO employ group-relative normalization at the sequence level, yielding where is the group mean. This creates a credit assignment gap: the RL signal does not resolve which subsequences specifically drive reward.
2. Potential-Based Reward Shaping and Token-Level Redistribution
OSPO adopts potential-based reward shaping [Ng et al., 1999]. For any shaping function over states:
Optimal policies are invariant to due to telescoping over episodes. In OSPO, segment and token-level attributions act as reward shaping potentials over partial sequences, assigning token-level credit without distorting policy optimality.
Redistribution preserves length invariance:
with normalized summing to one, ensuring (Lemma 1). This eliminates length bias in token-level RL signals.
3. Shapley-Owen Attribution for Segment Credit Assignment
OSPO decomposes model outputs into semantically coherent segments (e.g., phrases or sentences), and forms coalitions for reward evaluation:
- Characteristic function: for any coalition .
- Classical Shapley value for segment :
- Owen value (contiguous subset restriction):
The contiguous-span restriction lowers computational cost to , where is the maximum coalition width sampled per rollout, while preserving fairness axioms: efficiency (), symmetry, and linearity.
Segment-level is projected onto token-level attributions via a binary span-projection matrix : .
4. OSPO Optimization Objective and Workflow
For each prompt, completions are sampled. Sequence-level advantages are group-normalized:
Token-level attribution is obtained by normalizing and redistributing the per-sequence advantage:
The PPO-style surrogate objective accumulates clipped advantages:
with importance ratio .
Pseudocode proceeds as follows: for each context , sample generations, compute sequence rewards, segment outputs, evaluate up to contiguous coalitions, calculate , project and normalize to tokens, redistribute group-relative advantage, compute PPO-style ratios, and apply policy gradient updates.
5. Theoretical Guarantees
- Policy Optimality: Potential-based shaping guarantees OSPO’s redistribution does not affect the set of optimal policies.
- Length Invariance: Lemma 1 states that redistributed token advantages average to the original sequence-level advantage, eliminating token count bias.
- Convergence: Under standard PPO assumptions (bounded ratios, clipped objective), OSPO inherits monotonic improvement and convergence guarantees as established by PPO frameworks.
6. Empirical Evaluation and Benchmark Results
Experiments on generative search and summarization tasks were conducted over:
- Amazon ESCI (shopping queries): Query expansion for dense retrieval (retrievers: FAISS + all-mpnet-base-v2).
- H&M Fashion:
- Contextualized product search (expert LLM-synthesized queries, retrieval via SIMCSE-Large).
- User profile summarization: Chain-of-Thought response, candidate ranking (reward: Bradley-Terry head + format reward).
Baselines include SFT (supervised finetuning), DPO (contrastive preference optimization), and GRPO (group-relative PPO).
Metrics:
| Task | Metric | OSPO-Prop (7B) | GRPO | Large Model (Qwen-2.5-72B) |
|---|---|---|---|---|
| ESCI Search | NDCG | 0.522 | 0.418 | 0.543 |
| H&M Fashion Search | NDCG | 0.436 | 0.379 | — |
| Summarization | Win-rate (pairwise, LLM judge) | 49–54% | — | — |
OSPO demonstrates +25% relative improvement in NDCG over GRPO on ESCI, approaches metric parity with a much larger Qwen-2.5-72B model, and displays enhanced win-rates in summarization. In out-of-distribution (OOD) retriever shifts (all-mpnet SIMCSE), OSPO-Prop retained superior relative performance compared to GRPO and offline baselines.
7. Practical Considerations and Insights
OSPO’s fine-grained token-level attributions via Owen-Shapley values yield several key benefits:
- Sample Efficiency & Interpretability: Direct assignment of credit to causally impactful segments in sparse-reward settings enhances learning and model transparency.
- Robustness & Anti-hacking: Concentrating gradient updates on utility-driving text segments reduces susceptibility to reward hacking and spurious pattern overfitting.
- Coalition Design: Performance is sensitive to coalition width () and budget (); moderate contiguous widths (, ) optimize the trade-off between variance reduction and context preservation.
- Computational Overhead: OSPO incurs extra reward evaluations per rollout, remaining practical with fast retrieval-based rewards, but more resource-intensive with slower reward models. Hyperparameter tuning is essential.
- Black-box Feedback Suitability: OSPO does not require a learned value network, integrating seamlessly with PPO-style clipping and group normalization, which is advantageous in environments where reward models or retrievers are closed-source.
A plausible implication is that OSPO’s segment- and token-level credit assignment paradigm may extend to broader reasoning tasks in language modeling where sequence-level rewards and latent user intent dominate.
In summary, Owen-Shapley Policy Optimization leverages potential-based reward shaping and semantically coherent segment attributions to deliver principled, interpretable, and sample-efficient RL for generative LLMs, bridging the granularity gap in sequence-level supervisory regimes and achieving state-of-the-art results on targeted search and summarization tasks (Nath et al., 13 Jan 2026).