Papers
Topics
Authors
Recent
Search
2000 character limit reached

Owen-Shapley Policy Optimization

Updated 20 January 2026
  • OSPO is a reinforcement learning framework for LLMs that assigns token-level credit to impactful text segments, addressing sparse reward challenges.
  • It integrates potential-based reward shaping with Shapley-Owen attributions to precisely redistribute rewards and eliminate length bias.
  • OSPO demonstrates enhanced sample efficiency and robustness, achieving up to 25% NDCG improvement over GRPO on benchmark generative search tasks.

Owen-Shapley Policy Optimization (OSPO) is a reinforcement learning (RL) framework designed for LLMs operating in generative search and personalized recommendation settings. OSPO addresses the token-level credit assignment gap inherent in sequence-level reward regimes by leveraging principled reward redistribution techniques rooted in Shapley-Owen attributions over semantically coherent text segments. This approach enables direct credit assignment to causally impactful segments of generated language, improving sample efficiency and robustness, particularly in domains with latent user intent and sparse reward signals (Nath et al., 13 Jan 2026).

1. Formal Foundation and Sequence-Level Credit Assignment

OSPO frames single-turn LLM response generation as an episodic Markov Decision Process (MDP):

  • States stSs_t \in \mathcal{S}: Partial sequences (x,y<t)(x, y_{<t}), where xx is the input context and y<t=(y1,,yt1)y_{<t} = (y_1, \ldots, y_{t-1}) the previously generated tokens.
  • Actions atAa_t \in \mathcal{A}: Selection of the next token yty_t; transitions are deterministic: st+1=(x,yt)s_{t+1} = (x, y_{\le t}).
  • Trajectories τ=(s0,a0,,sT,aT)\tau = (s_0, a_0, \ldots, s_T, a_T): Correspond to full generations y=(y1,,yT)y = (y_1, \ldots, y_T) sampled under policy πθ\pi_\theta.
  • Rewards: Sparse and terminal (rt=0r_t = 0 for t<Tt < T, rT=R(τ)=r(x,y)r_T = R(\tau) = r(x,y)) and discount factor γ=1\gamma = 1.

The RL objective is maximizing expected terminal reward:

J(θ)=Eτπθ[R(τ)]J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} [R(\tau)]

The gradient estimator utilizes sequence-level advantage:

θJ(θ)=Eτπθ[t=1Tθlogπθ(atst)A(τ)]\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot A(\tau) \right]

with A(τ)=R(τ)bA(\tau) = R(\tau) - b for baseline bb.

Standard methods such as GRPO employ group-relative normalization at the sequence level, yielding A(τ)=R(τ)RˉA(\tau) = R(\tau) - \bar R where Rˉ\bar R is the group mean. This creates a credit assignment gap: the RL signal does not resolve which subsequences specifically drive reward.

2. Potential-Based Reward Shaping and Token-Level Redistribution

OSPO adopts potential-based reward shaping [Ng et al., 1999]. For any shaping function ϕ(s)\phi(s) over states:

r^(s,a,s)=r(s,a)+ϕ(s)ϕ(s)\hat r(s,a,s') = r(s,a) + \phi(s') - \phi(s)

Optimal policies are invariant to ϕ\phi due to telescoping over episodes. In OSPO, segment and token-level attributions ϕt\phi_t act as reward shaping potentials over partial sequences, assigning token-level credit without distorting policy optimality.

Redistribution preserves length invariance:

At(g)=Tϕ~t(g)A^(g)A_t^{(g)} = T \cdot \tilde \phi_t^{(g)} \cdot \hat A^{(g)}

with normalized ϕ~t\tilde \phi_t summing to one, ensuring (1/T)tAt(g)=A^(g)(1/T) \sum_t A_t^{(g)} = \hat A^{(g)} (Lemma 1). This eliminates length bias in token-level RL signals.

3. Shapley-Owen Attribution for Segment Credit Assignment

OSPO decomposes model outputs into NN semantically coherent segments A={a1,,aN}A = \{a_1, \ldots, a_N\} (e.g., phrases or sentences), and forms coalitions for reward evaluation:

  • Characteristic function: v(S)=r(x,concatjSaj)v(S) = r(x, \text{concat}_{j \in S} a_j) for any coalition S{1,,N}S \subseteq \{1, \ldots, N\}.
  • Classical Shapley value for segment ii:

ϕi=SN{i}S!(NS1)!N![v(S{i})v(S)]\phi_i = \sum_{S \subseteq N \setminus \{i\}} \frac{|S|! (N - |S| - 1)!}{N!} [v(S \cup \{i\}) - v(S)]

  • Owen value (contiguous subset restriction):

Ci={all contiguous S∌i with S{i} contiguous}C_i = \{ \text{all contiguous } S \not\ni i \text{ with } S \cup \{i\} \text{ contiguous} \}

ϕiOwen=1CiSCi[v(S{i})v(S)]\phi_i^{Owen} = \frac{1}{|C_i|} \sum_{S \in C_i} [v(S \cup \{i\}) - v(S)]

The contiguous-span restriction lowers computational cost to O(Nwmax)O(N \cdot w_{max}), where wmaxw_{max} is the maximum coalition width sampled per rollout, while preserving fairness axioms: efficiency (iϕi=R\sum_i \phi_i = R), symmetry, and linearity.

Segment-level ϕi\phi_i is projected onto token-level attributions ϕt\phi_t via a binary span-projection matrix W{0,1}N×TW \in \{0,1\}^{N \times T}: ϕtok=Wϕseg\phi_{tok} = W^\top \phi_{seg}.

4. OSPO Optimization Objective and Workflow

For each prompt, GG completions are sampled. Sequence-level advantages are group-normalized:

A^(g)=R(g)RˉσR\hat A^{(g)} = \frac{R^{(g)} - \bar R}{\sigma_R}

Token-level attribution is obtained by normalizing and redistributing the per-sequence advantage:

At(g)=Tϕ~t(g)A^(g)A_t^{(g)} = T \cdot \tilde \phi_t^{(g)} \cdot \hat A^{(g)}

The PPO-style surrogate objective accumulates clipped advantages:

JOSPO(θ)=1Gg=1G1Tt=1Tmin(ρt(g)At(g), clip(ρt(g),1ϵ,1+ϵ)At(g))J^{OSPO}(\theta) = \frac{1}{G} \sum_{g=1}^G \frac{1}{T} \sum_{t=1}^T \min\left( \rho_t^{(g)} A_t^{(g)},~ \text{clip}(\rho_t^{(g)}, 1-\epsilon, 1+\epsilon) A_t^{(g)}\right)

with importance ratio ρt(g)=πθ(yt(g))πθold(yt(g))\rho_t^{(g)} = \frac{\pi_\theta(y_t^{(g)}|\dots)}{\pi_{\theta_{old}}(y_t^{(g)}|\dots)}.

Pseudocode proceeds as follows: for each context xx, sample GG generations, compute sequence rewards, segment outputs, evaluate up to MM contiguous coalitions, calculate ϕiOwen\phi_i^{Owen}, project and normalize to tokens, redistribute group-relative advantage, compute PPO-style ratios, and apply policy gradient updates.

5. Theoretical Guarantees

  • Policy Optimality: Potential-based shaping guarantees OSPO’s redistribution does not affect the set of optimal policies.
  • Length Invariance: Lemma 1 states that redistributed token advantages average to the original sequence-level advantage, eliminating token count bias.
  • Convergence: Under standard PPO assumptions (bounded ratios, clipped objective), OSPO inherits monotonic improvement and convergence guarantees as established by PPO frameworks.

6. Empirical Evaluation and Benchmark Results

Experiments on generative search and summarization tasks were conducted over:

  • Amazon ESCI (shopping queries): Query expansion for dense retrieval (retrievers: FAISS + all-mpnet-base-v2).
  • H&M Fashion:
    • Contextualized product search (expert LLM-synthesized queries, retrieval via SIMCSE-Large).
    • User profile summarization: Chain-of-Thought response, candidate ranking (reward: Bradley-Terry head + format reward).

Baselines include SFT (supervised finetuning), DPO (contrastive preference optimization), and GRPO (group-relative PPO).

Metrics:

Task Metric OSPO-Prop (7B) GRPO Large Model (Qwen-2.5-72B)
ESCI Search NDCG 0.522 0.418 0.543
H&M Fashion Search NDCG 0.436 0.379
Summarization Win-rate (pairwise, LLM judge) 49–54%

OSPO demonstrates +25% relative improvement in NDCG over GRPO on ESCI, approaches metric parity with a much larger Qwen-2.5-72B model, and displays enhanced win-rates in summarization. In out-of-distribution (OOD) retriever shifts (all-mpnet \leftrightarrow SIMCSE), OSPO-Prop retained superior relative performance compared to GRPO and offline baselines.

7. Practical Considerations and Insights

OSPO’s fine-grained token-level attributions via Owen-Shapley values yield several key benefits:

  • Sample Efficiency & Interpretability: Direct assignment of credit to causally impactful segments in sparse-reward settings enhances learning and model transparency.
  • Robustness & Anti-hacking: Concentrating gradient updates on utility-driving text segments reduces susceptibility to reward hacking and spurious pattern overfitting.
  • Coalition Design: Performance is sensitive to coalition width (wmaxw_{max}) and budget (MM); moderate contiguous widths (w48w \approx 4–8, M3296M \approx 32–96) optimize the trade-off between variance reduction and context preservation.
  • Computational Overhead: OSPO incurs MM extra reward evaluations per rollout, remaining practical with fast retrieval-based rewards, but more resource-intensive with slower reward models. Hyperparameter tuning is essential.
  • Black-box Feedback Suitability: OSPO does not require a learned value network, integrating seamlessly with PPO-style clipping and group normalization, which is advantageous in environments where reward models or retrievers are closed-source.

A plausible implication is that OSPO’s segment- and token-level credit assignment paradigm may extend to broader reasoning tasks in language modeling where sequence-level rewards and latent user intent dominate.


In summary, Owen-Shapley Policy Optimization leverages potential-based reward shaping and semantically coherent segment attributions to deliver principled, interpretable, and sample-efficient RL for generative LLMs, bridging the granularity gap in sequence-level supervisory regimes and achieving state-of-the-art results on targeted search and summarization tasks (Nath et al., 13 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Owen-Shapley Policy Optimization (OSPO).