Unsupervised Candidate Scoring Without Oracle Reward Models
Determine how unsupervised candidate-scoring methods—for example, gradient-based approaches such as Prismatic Synthesis—perform at evaluating individual candidates in multi-candidate large language model generation when an oracle reward model is unavailable.
References
Another interesting direction concerns the scoring of candidates: when an oracle reward model is unavailable, it remains unclear how unsupervised approaches, e.g., gradient-based~\citep{jung2025prismatic}, would perform for candidate evaluation.
— ShapE-GRPO: Shapley-Enhanced Reward Allocation for Multi-Candidate LLM Training
(2603.29871 - Ai et al., 31 Mar 2026) in Section 6: Conclusion and Discussion