Papers
Topics
Authors
Recent
Search
2000 character limit reached

G$^2$RPO-A: Guided Group Relative Policy Optimization with Adaptive Guidance

Published 18 Aug 2025 in cs.AI | (2508.13023v1)

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has markedly enhanced the reasoning abilities of LLMs. Its success, however, largely depends on strong base models with rich world knowledge, yielding only modest improvements for small-size LLMs (SLMs). To address this limitation, we investigate Guided GRPO, which injects ground-truth reasoning steps into roll-out trajectories to compensate for SLMs' inherent weaknesses. Through a comprehensive study of various guidance configurations, we find that naively adding guidance delivers limited gains. These insights motivate G$2$RPO-A, an adaptive algorithm that automatically adjusts guidance strength in response to the model's evolving training dynamics. Experiments on mathematical reasoning and code-generation benchmarks confirm that G$2$RPO-A substantially outperforms vanilla GRPO. Our code and models are available at https://github.com/T-Lab-CUHKSZ/G2RPO-A.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 5 tweets with 46 likes about this paper.