Unsupervised Candidate Scoring Without Oracle Reward Models

Determine how unsupervised candidate-scoring methods—for example, gradient-based approaches such as Prismatic Synthesis—perform at evaluating individual candidates in multi-candidate large language model generation when an oracle reward model is unavailable.

Background

ShapE-GRPO allocates set-level rewards to individual candidates using candidate-level Shapley values, which presupposes access to an oracle reward function capable of scoring candidates. In practical deployments, such a reward model may be unavailable or costly to obtain.

The authors highlight that unsupervised approaches—particularly gradient-based techniques—might be used to evaluate candidates without a reward model, but their effectiveness for candidate evaluation is not established. Understanding the performance of such unsupervised scoring methods is necessary to apply candidate-level allocation in settings lacking an explicit reward model.

References

Another interesting direction concerns the scoring of candidates: when an oracle reward model is unavailable, it remains unclear how unsupervised approaches, e.g., gradient-based~\citep{jung2025prismatic}, would perform for candidate evaluation.

ShapE-GRPO: Shapley-Enhanced Reward Allocation for Multi-Candidate LLM Training  (2603.29871 - Ai et al., 31 Mar 2026) in Section 6: Conclusion and Discussion