A Survey of Reinforcement Learning for Large Reasoning Models

Published 10 Sep 2025 in cs.CL, cs.AI, and cs.LG | (2509.08827v1)

Abstract: In this paper, we survey recent advances in Reinforcement Learning (RL) for reasoning with LLMs. RL has achieved remarkable success in advancing the frontier of LLM capabilities, particularly in addressing complex logical tasks such as mathematics and coding. As a result, RL has emerged as a foundational methodology for transforming LLMs into LRMs. With the rapid progress of the field, further scaling of RL for LRMs now faces foundational challenges not only in computational resources but also in algorithm design, training data, and infrastructure. To this end, it is timely to revisit the development of this domain, reassess its trajectory, and explore strategies to enhance the scalability of RL toward Artificial SuperIntelligence (ASI). In particular, we examine research applying RL to LLMs and LRMs for reasoning abilities, especially since the release of DeepSeek-R1, including foundational components, core problems, training resources, and downstream applications, to identify future opportunities and directions for this rapidly evolving area. We hope this review will promote future research on RL for broader reasoning models. Github: https://github.com/TsinghuaC3I/Awesome-RL-for-LRMs

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that applying reinforcement learning with verifiable rewards transforms large language models into robust reasoning models capable of tackling complex tasks like math and coding.
It formalizes the RL framework for language models by mapping states, actions, policies, and rewards to tokens, prompts, and sequence-level correctness.
Results indicate that critic-free methods, dynamic sampling strategies, and hybrid reward designs enable scalable and efficient training of advanced reasoning capabilities.

A Survey of Reinforcement Learning for Large Reasoning Models

Introduction and Motivation

Reinforcement Learning (RL) has become a central methodology for enhancing the reasoning capabilities of LLMs, transforming them into Large Reasoning Models (LRMs). The surveyed work provides a comprehensive synthesis of the field, focusing on the application of RL to LLMs for complex logical tasks, such as mathematics, coding, and agentic behaviors. The survey emphasizes the transition from RL for human alignment (e.g., RLHF, DPO) to RL for verifiable reasoning (RLVR), highlighting the emergence of new scaling axes—particularly the allocation of test-time compute and the explicit incentivization of reasoning processes.

Figure 1: Overview of the survey, highlighting foundational RL components for LRMs, open problems, resources, and applications, with a focus on large-scale agent-environment interactions and long-term evolution.

Evolution of RL for LLMs and LRMs

The field has evolved from RLHF and DPO, which primarily address human alignment, to RLVR, which leverages verifiable, programmatic rewards for tasks with objective ground truth (e.g., math, code). This shift is exemplified by models such as OpenAI o1 and DeepSeek-R1, which demonstrate that RL with verifiable rewards can induce sophisticated, long-form reasoning behaviors, including planning, reflection, and self-correction. The survey documents a timeline of both open-source and proprietary models, showing rapid progress in reasoning benchmarks and the expansion of RL to multimodal and agentic domains.

Figure 2: RLHF and DPO as dominant alignment methods, with RLVR emerging as a key trend for complex task-solving in LRMs; open-ended RL is identified as a major future challenge.

Figure 3: Timeline of representative reasoning models trained with RL, spanning language, multimodal, and agentic models.

RL Formulation for LLMs

The survey formalizes the mapping of RL concepts to the language modeling domain:

State: Prompt plus generated tokens so far.
Action: Next token selection.
Policy: The LLM itself.
Reward: Typically assigned at the sequence level (e.g., correctness), but can be decomposed to token, step, or turn levels for denser feedback.
Transition: Deterministic concatenation of tokens.

This MDP formulation enables the direct application of RL algorithms to LLMs, with the learning objective being the maximization of expected reward over the data distribution.

Figure 4: RL and LMs as agents—tokens as actions, context as state, and rewards typically at the response level.

Foundational Components

Reward Design

The survey provides a taxonomy of reward types:

Verifiable Rewards: Rule-based, programmatic checkers for math/code; highly scalable and robust against reward hacking.
Generative Rewards: Model-based verifiers and reward models (GenRMs) for subjective or non-verifiable tasks, including rubric-based and co-evolving systems.
Dense Rewards: Token-, step-, and turn-level rewards for improved credit assignment and sample efficiency.
Unsupervised Rewards: Model-specific (e.g., consistency, confidence, self-rewarding) and model-agnostic (e.g., heuristic, data-centric) approaches to bypass human annotation bottlenecks.
Reward Shaping: Combination of rule-based and model-based signals, group baselines, and Pass@K-aligned objectives to stabilize and align training.

The survey highlights the critical role of verifiable rewards in scaling RL for reasoning, while generative and dense rewards are essential for subjective or process-oriented tasks.

Policy Optimization

RL for LLMs is dominated by first-order, gradient-based algorithms:

Critic-based: PPO and variants, using value models for token-level advantage estimation; effective but computationally intensive.
Critic-free: REINFORCE, GRPO, and derivatives, using sequence-level or group-normalized advantages; more scalable for RLVR tasks.
Off-policy: Methods leveraging historical or offline data, importance sampling, and hybrid SFT+RL objectives for improved sample efficiency.
Regularization: KL and entropy regularization to balance exploration and stability; length penalties to control reasoning cost.

The survey notes that critic-free methods, especially GRPO and its variants, have become the de facto standard for large-scale RLVR due to their simplicity and scalability.

Sampling Strategies

Efficient sampling is crucial for RL stability and performance:

Dynamic Sampling: Online filtering, curriculum learning, and prioritized sampling to focus compute on informative or under-mastered examples.
Structured Sampling: Tree-based rollouts, shared prefixes, and segment-wise sampling for efficient credit assignment and compute reuse.
Hyperparameter Tuning: Careful management of temperature, entropy, and sequence length budgets to balance exploration, efficiency, and cost.

Foundational Problems

The survey identifies and analyzes several open problems:

Sharpening vs. Discovery: Debate over whether RL merely sharpens latent capabilities or enables genuine discovery of new reasoning patterns. Recent evidence suggests both phenomena can occur, depending on training duration, regularization, and model priors.
RL vs. SFT: RL tends to generalize better under distribution shift, while SFT is prone to memorization. Hybrid and unified paradigms are emerging as best practice.
Model Priors: RL responsiveness varies dramatically across model families (e.g., Qwen vs. Llama); mid-training and curriculum design can mitigate weak priors.
Training Recipes: Minimalist, reproducible recipes (e.g., dynamic sampling, decoupled clipping) are favored over complex normalization tricks; unified evaluation protocols are needed.
Reward Type: Outcome rewards are scalable but risk unfaithful reasoning; process rewards offer dense guidance but are costly to annotate. Hybrid approaches are promising.

Training Resources

The survey catalogs the ecosystem of RL resources:

Static Corpora: High-quality, verifiable datasets for math, code, STEM, and agentic tasks, with increasing emphasis on process traces and difficulty stratification.
Dynamic Environments: Rule-based, code-based, game-based, and model-based environments for scalable, interactive RL; essential for agentic and open-ended tasks.
RL Infrastructure: Open-source frameworks (e.g., TRL, OpenRLHF, Verl, AReaL, NeMo-RL, ROLL, slime) supporting distributed, asynchronous, and agentic RL pipelines, with growing support for multimodal and multi-agent settings.

Applications

RL for LRMs has enabled advances across multiple domains:

Coding: RLVR and agentic RL have improved competitive programming, domain-specific code generation, and repository-level software engineering.
Agentic Tasks: RL-trained agents excel in tool use, search, browsing, GUI/computer use, and deep research, with asynchronous rollouts and memory mechanisms reducing latency.
Multimodal: RL enhances both understanding (image, video, 3D) and generation (image, video) in multimodal models, with unified frameworks for cross-modal reasoning.
Multi-Agent Systems: RL enables improved collaboration, credit assignment, and policy optimization in LLM-based MAS, though efficient communication and scalability remain open challenges.
Robotics: RL for Vision-Language-Action (VLA) models addresses data scarcity and generalization, with outcome-level rewards enabling scalable training.
Medical: RL is well established for verifiable medical tasks; non-verifiable tasks require rubric-based or offline RL, with scalability and stability as ongoing challenges.

Future Directions

The survey outlines several promising research avenues:

Continual RL: Lifelong learning and adaptation to evolving tasks and data.
Memory-based RL: Structured, reusable experience repositories for agentic reasoning.
Model-based RL: Integration of world models for richer state representations and scalable reward generation.
Efficient Reasoning: Resource-rational compute allocation and adaptive halting policies.
Latent Space Reasoning: RL in continuous latent spaces for smoother, more powerful reasoning.
RL for Pre-training: RL as a scaling strategy during pre-training, not just post-training.
Diffusion-based LLMs: RL for DLLMs, with challenges in likelihood estimation and trajectory optimization.
Scientific Discovery: RL-driven LLMs for open-ended, verifiable scientific tasks.
Architecture-Algorithm Co-Design: RL as a mechanism for dynamic, hardware-aware model adaptation.

Conclusion

This survey provides a systematic synthesis of RL for LRMs, emphasizing the centrality of verifiable rewards, scalable policy optimization, and efficient sampling. The field is rapidly advancing toward more general, autonomous, and efficient reasoning models, with RL serving as both a catalyst and a bottleneck. Theoretical and practical challenges remain, particularly in reward design, generalization, and infrastructure, but the trajectory toward scalable, agentic, and multimodal reasoning is clear. Future progress will depend on unified evaluation, reproducible recipes, and the integration of RL throughout the model lifecycle—from pre-training to deployment.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Glossary

off on

Practical Applications

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper is a big, friendly map of a fast‑growing area in AI: teaching LLMs to think better using reinforcement learning (RL). When RL is used to push LLMs beyond simple reply-giving and into real problem solving, the authors call them large reasoning models (LRMs). The paper explains how RL has recently boosted models’ skills in things like math, coding, planning, and step‑by‑step thinking, and it organizes the field so researchers know what works, what’s hard, and where to go next.

What questions does it ask?

In simple terms, the survey looks at:

How do we turn a chatty AI into a careful, step‑by‑step reasoner?
What kinds of “rewards” help models learn to reason (for example, checking if a math answer is correct)?
Which training algorithms and sampling tricks work best?
What data, environments, and tools are needed to scale RL to very large systems?
Where is RL most useful today (coding, agents, multimodal vision-text tasks, robotics, medical) and what open problems still block progress?

How did the authors study this?

This is a survey paper. That means the authors didn’t run one big experiment; they read and organized many papers and systems, especially since famous releases like OpenAI’s o1 and DeepSeek-R1. They grouped the field into clear parts:

Reward design: how to score a model’s answers so it knows what to improve.
Policy optimization: the training methods that push the model toward higher rewards.
Sampling strategies: smart ways to pick training examples and “thinking budgets” during training.
Resources and applications: datasets, environments, infrastructure, and where RL is being used.

If RL sounds abstract, think of it like a video game: the model tries actions (writing text, solving steps), gets points (rewards) when it does well, and learns to choose better actions next time. The twist for reasoning is that “points” can come from an answer checker (like passing a unit test for code) or from a judge (a smaller model or rubric that scores the solution).

What did they find?

1) A simple, checkable reward can go a long way

When a task has a clear right or wrong answer—like many math questions or coding problems—models can be trained with “verifiable rewards.” For example, math correctness or passing unit tests gives a reliable score. This makes RL training stable and scalable, and it’s a big reason models like o1 and DeepSeek‑R1 became strong reasoners.
Key idea: if you can automatically check success, you can train efficiently. Tasks that are vague or subjective are harder because “what’s good” is less clear.

2) Different kinds of rewards serve different needs

The paper groups rewards into several types:

Verifiable (rule‑based): direct checks like “is the final answer correct?” or “did the code pass the test?” Great for math/code.
Generative: the model (or another model) acts as a judge, providing a score or feedback. This helps on tasks without a simple answer key, but you must avoid “reward hacking” (cheating the judge).
Dense rewards: instead of scoring only the final answer, you score parts of the process—per token, per step, or per turn—like a teacher giving partial credit and hints along the way. This guides the model’s chain‑of‑thought.
Unsupervised/self‑rewards: the model learns to set or estimate rewards by itself when no labels exist. Useful but tricky, because the model can learn to game its own rules.
Reward shaping: combining, smoothing, or restructuring rewards (for example, encouraging good format, penalizing needlessly long answers, or using multiple small rewards) to make learning easier.

Analogy: Sometimes you grade only the final test (sparse reward). Sometimes you also grade homework steps and give rubrics (dense reward). Sometimes the student makes their own practice problems and answer keys (self‑reward), but you have to check they aren’t cheating.

3) Training methods: “critic‑based” vs “critic‑free”

Critic‑based methods (like PPO) train a helper network (a “critic”) to estimate how good an action is. This can be stable but more complex.
Critic‑free methods (like REINFORCE or GRPO) skip the critic and use clever tricks (baselines, group comparisons) to keep learning stable. These have become popular in large‑scale reasoning because they are simpler to run at huge scale.
Regularization helps: keeping the trained model close to a reference model (KL penalty), encouraging healthy exploration (entropy), and balancing answer length (length penalties) stabilize training and keep outputs readable.

4) Sampling and “thinking time” matter

Giving the model more “thinking time” at test time (letting it generate more intermediate steps) often improves results. This is like telling a student, “Take your time, show your work.”
During training, dynamic sampling (choosing harder/easier problems at the right times, adjusting how many samples you draw, or changing temperatures) improves efficiency and skill growth.

5) Resources and applications are expanding fast

Resources: The field now has static datasets (e.g., math/coding problems), dynamic environments (tools, simulators, websites), and training infrastructure for massive online RL runs.
Applications:
- Coding (fixing bugs, passing benchmarks)
- Agentic tasks (using tools, browsing, planning multi‑step actions)
- Multimodal reasoning (text + images/video/audio)
- Multi‑agent systems (AIs coordinating with each other)
- Robotics and medical use cases (more controlled and safety‑critical)

6) Important open questions

RL vs supervised fine‑tuning (SFT): When should you use RL (trial‑and‑error with rewards) instead of SFT (learning from examples)? Many strong systems use both at different stages.
Reward definitions: How to create reliable rewards for tasks without a clear answer key?
Training recipes: What are the best mixes of data, rewards, regularization, and compute budgets?
Scaling: Beyond just more data and bigger models, how do we scale “train‑time RL” and “test‑time thinking” efficiently and safely?

Why does this matter?

Clear takeaways:

RL gives models a new way to grow: not just by reading more text, but by practicing tasks with feedback, like a student doing graded exercises.
For tasks with automatic checks (math, code), RL has already unlocked big reasoning gains. That’s why so many recent “thinking” model breakthroughs focus there first.
As rewards get better (denser, fairer, harder to cheat) and training tools improve, we can teach models to reason across more real‑world tasks.
The long‑term vision is that RL could keep scaling models’ reasoning abilities, helping them plan, reflect, and self‑correct—key steps toward truly helpful, trustworthy AI assistants.

In short, this survey shows how reinforcement learning has become a core ingredient for making LLMs better thinkers, maps out the methods that work, and highlights the challenges we must solve to push reasoning AI even further.

View Paper Prompt View All Prompts

Open Problems

Continue Learning

Authors (39)

First 10 authors:

Collections

GitHub

GitHub - TsinghuaC3I/Awesome-RL-for-LRMs: Awesome Reinforcement Learning for Large Reasoning Models (RL4LRM) (827 stars)

Tweets

A Survey of Reinforcement Learning for Large Reasoning Models

Summary

A Survey of Reinforcement Learning for Large Reasoning Models

Introduction and Motivation

Evolution of RL for LLMs and LRMs

RL Formulation for LLMs

Foundational Components

Reward Design

Policy Optimization

Sampling Strategies

Foundational Problems

Training Resources

Applications

Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions does it ask?

How did the authors study this?

What did they find?

1) A simple, checkable reward can go a long way

2) Different kinds of rewards serve different needs

3) Training methods: “critic‑based” vs “critic‑free”

4) Sampling and “thinking time” matter

5) Resources and applications are expanding fast

6) Important open questions

Why does this matter?

Open Problems

Continue Learning

Related Papers

Authors (39)

Collections

GitHub

Tweets

YouTube

HackerNews

alphaXiv