Papers
Topics
Authors
Recent
Search
2000 character limit reached

Uncalibrated Reasoning: GRPO Induces Overconfidence for Stochastic Outcomes

Published 15 Aug 2025 in cs.LG and cs.AI | (2508.11800v1)

Abstract: Reinforcement learning (RL) has proven remarkably effective at improving the accuracy of LLMs in verifiable and deterministic domains like mathematics. Here, we examine if current RL methods are also effective at optimizing LLMs in verifiable domains with stochastic outcomes, like scientific experiments. Through applications to synthetic data and real-world biological experiments, we demonstrate that Group Relative Policy Optimization (GRPO) induces overconfident probability predictions for binary stochastic outcomes, while Proximal Policy Optimization (PPO) and REINFORCE Leave-One-Out (RLOO) yield well-calibrated models. We show that removing group standard normalization in GRPO fixes its miscalibration and provide a theoretical explanation for why normalization causes overconfidence. Our results provide new evidence against the use of standard normalization in GRPO and help pave the way for applications of RL for reasoning LLMs beyond deterministic domains.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (2)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 7 tweets with 4 likes about this paper.