Papers
Topics
Authors
Recent
Search
2000 character limit reached

ShapE-GRPO: Shapley-Enhanced Reward Allocation for Multi-Candidate LLM Training

Published 31 Mar 2026 in cs.AI | (2603.29871v1)

Abstract: In user-agent interaction scenarios such as recommendation, brainstorming, and code suggestion, LLMs often generate sets of candidate recommendations where the objective is to maximize the collective utility of the entire set rather than individual candidates independently. However, existing reinforcement learning post-training paradigms, such as Group Relative Policy Optimization (GRPO), typically assign the same set-level scalar reward to every candidate in the set. This leads to noisy training signals where poor candidates free-ride on the high reward produced by a single strong peer, resulting in suboptimal exploration. To address this, we propose Shapley-Enhanced GRPO (ShapE-GRPO). By leveraging the permutation-invariant nature of set-level utility, we derive a Shapley-enhanced formulation from cooperative game theory to decompose set-level rewards into granular, candidate-specific signals. We show that our formulation preserves the fundamental axioms of the Shapley value while remaining computationally efficient with polynomial-time complexity. Empirically, ShapE-GRPO consistently outperforms standard GRPO across diverse datasets with accelerated convergence during training.

Summary

  • The paper introduces a candidate-level reward allocation mechanism using Shapley values to resolve free-riding in multi-candidate LLM training.
  • It employs a polynomial-time computation (O(K^2)) to ensure unbiased gradient updates and improved convergence over traditional RLHF methods.
  • Empirical results across summarization, code suggestion, and recommendation tasks demonstrate enhanced model performance and training stability.

ShapE-GRPO: Shapley-Enhanced Reward Allocation for Multi-Candidate LLM Training

Introduction and Motivation

The emergence of multi-candidate generation scenarios—such as recommendation, brainstorming, code suggestion, and summarization—has stressed the limitations of traditional RLHF protocols that assign scalar rewards at the sequence or set level. Methods like Group Relative Policy Optimization (GRPO) provide a single feedback signal for all candidates in a response, leading to noisy gradients and inefficient exploration. The lack of granular credit assignment enables uninformative candidates to "free-ride" on the reward obtained by high-utility candidates, which both slows down convergence and distorts the optimization landscape for LLMs acting as recommenders.

ShapE-GRPO introduces a principled allocation mechanism for set-level reward signals based on the Shapley value from cooperative game theory. This approach explicitly accounts for the permutation-invariant semantics of candidate sets (e.g., the set of movie recommendations is typically unordered), and provides unbiased and theoretically sound candidate-level reward assignments. The proposed framework also addresses computational tractability concerns by exploiting reward function structure, enabling exact Shapley value computation with O(K2)\mathcal{O}(K^2) complexity where KK is the number of candidates.

Methodology: Candidate-Level Shapley Value Decomposition

ShapE-GRPO models each candidate in the LLM-generated set as a "player" in a cooperative game, where the set-level reward (e.g., maximum user satisfaction over recommendations) forms the value of the coalition. The credit to each candidate is determined by its expected marginal contribution over all permutations—i.e., its Shapley value:

ϕi=SC{ci}S!(KS1)!K![R(S{ci})R(S)]\phi^i = \sum_{S \subseteq \mathcal{C} \setminus \{c^i\}} \frac{|S|! (K - |S| - 1)!}{K!}\left[R(S \cup \{c^i\}) - R(S)\right]

This allocation aligns with fundamental Shapley properties: efficiency, symmetry, additivity, and the null player axiom. Importantly, when the reward is permutation-invariant (e.g., R({c1,,cK})=maxiR(ci)R(\{c^1,\dots,c^K\}) = \max_i R(c^i) in click-through tasks), the candidate-level Shapley values are both semantically justified and efficiently computable. In practice, ShapE-GRPO broadcasts the candidate's Shapley value equally to its constituent tokens, while the non-candidate reasoning segment of the output receives the usual set-level reward.

This reward broadcast is then plugged into the RL post-training objective. Specifically, the advantage term for LLM policy optimization is reweighted by these candidate-level Shapley values, eliminating the free-rider problem endemic to GRPO and Winner-Takes-All baselines, and yielding more discriminative gradients. Under mild assumptions of candidate length balance, the total reward allocation remains unbiased relative to the set-level reward, preserving the global optimization trajectory.

Theoretical Properties

ShapE-GRPO maintains several key theoretical invariants:

  • Unbiasedness: If candidate tokens have matched lengths, the aggregate candidate-level Shapley rewards sum (after appropriate scaling) matches the original set-level reward; the optimization objective remains unbiased.
  • No Free-Riding: The null player property ensures that candidates making zero marginal contribution (e.g., a suggestion never clicked or a wrong code snippet) receive zero or negative reward signals. This design strictly prohibits positive reinforcement for non-contributory candidates.
  • Polynomial Complexity: For typical reward functions in LLM recommendation (max or binary, for instance), exact Shapley values are computed with only O(K2)\mathcal{O}(K^2) floating-point operations, compared to the exponential complexity in generic cooperative games.
  • Permutation Invariance: Rewards are robust to the ordering of candidates, aligning with practical deployments where the candidate set is unordered.

These properties collectively constitute a mathematically rigorous credit assignment strategy for post-training LLMs in set-valued feedback environments. Figure 1

Figure 1

Figure 2: (Left) Candidate length coefficient of variation (CV) distributions for summarization; (Right) Training curves on Netflix recommendations—ShapE-GRPO converges faster and with more stability than GRPO.

Empirical Evaluation

Experimental Design

Experiments are conducted on Qwen3-8B and Llama3.1-8B-Instruct across three core settings:

  • Aspect-Based Scientific Summarization (ACLSum): Multi-candidate summaries, reward from LLM-as-judge protocol (GPT-5.2) normalized to [0,1][0, 1].
  • Data Science Code Suggestion (DS-1000): Four candidate solutions per prompt, binary reward (success if any passes tests).
  • Movie Recommendation (Netflix): Four candidate recommendations per user, explicit user history as context, reward is ground-truth rating.

Ablations include comparison to GRPO and Winner-Takes-All reward assignment. For Netflix, variants with and without user history are included. Evaluation measures are reward averages on held-out sets. Figure 3

Figure 3

Figure 3

Figure 3

Figure 1: Response reward as a function of the number of candidates across datasets using Qwen3-8B, illustrating diminishing returns as KK increases.

Main Results

ShapE-GRPO yields consistent advancement over both standard GRPO and WTA across all tasks. Notably:

  • ACLSum: ShapE-GRPO improves from 0.658 (GRPO) to 0.672; Winner-Takes-All actually falls below the baseline (0.642).
  • DS-1000: From 0.465 (GRPO) to 0.510 (ShapE-GRPO).
  • Netflix w/ history: Substantial lift from 3.710 (GRPO) to 4.141 (ShapE-GRPO) on the normalized rating scale.
  • Training Dynamics: On Netflix, ShapE-GRPO converges in about 160 steps (compared to 240 for GRPO), and exhibits much lower reward variance across checkpoints (see Figure 2, right).

Such empirical findings robustly support the claim that fine-grained, coalition-aware reward assignment at the candidate level both accelerates post-training convergence and yields stronger, more stable models for real-world RLHF tasks with set-valued feedback.

Theoretical and Practical Implications

The adoption of ShapE-GRPO carries several downstream consequences:

  • RLHF for Set-Level Metrics: Many real-world tasks (shopping, movie recommendation, brainstorming, code suggestion) benefit from set-level feedback, not sequence-level. Traditional pipelines are sample-inefficient and bias-prone in these domains; ShapE-GRPO closes this gap.
  • Diversity and Exploration: By aligning the reward with marginal utility, the algorithm implicitly encourages diverse candidate outputs; adding a redundant candidate yields little or negative reward, guiding the model to expand coverage.
  • Computational Scalability: The polynomial runtime for candidate-level Shapley computation means ShapE-GRPO is deployable for online and large-scale scenarios—this is not generally true for token-wise Shapley allocation.
  • Generalization: Experiments on models with and without access to user history underline difficulties with long-context utilization. Methods like ShapE-GRPO may synergize with architectural enhancements for context compression or retrieval-augmented generation.
  • Axiomatic Soundness: Strong theoretical guarantees prevent reward allocation pathologies (free-riding, reward hacking), making the method principled for practical integration into RLHF, RL-augmented recommendation, and ideation systems.

Future Directions

Open research questions include exploration of alternative coalition value measures (e.g., instrumental value), application to unsupervised and gradient-based reward estimation protocols, and extension to scenarios with partial or noisy reward models. Analyzing interaction with techniques for long-context understanding and inference within LLMs is also of interest.

Conclusion

ShapE-GRPO provides a mathematically principled, computationally efficient, and empirically validated mechanism for reward allocation in multi-candidate LLM post-training (2603.29871). By exploiting cooperative game theory and the permutation-invariant structure of set-level feedback, it addresses core deficiencies of sequence-level RLHF and establishes a new baseline for efficient, robust, and scalable LLM-as-a-recommender pipelines. These findings motivate broader adoption of coalition-aware reward design and open up new lines of inquiry on advanced credit assignment for RL-based training of generative models.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.